What is version control?
Version control is a crucial part of almost any development process. Whether they know it or not, most people are doing some level of version control, whether they’re working with documents, source code, or almost any digital file on their file system.
Some methods of version control are stronger than others. For example, saving files with different names to manage their respective versions is a form of version control, albeit a highly error-prone and messy one. The CMS used to write this blog post, and the word processor it was drafted in, both have version control systems, too.
A better approach is to implement tooling to version files and keep things organized, without interrupting your workflow. But when it comes to tooling, there are a few considerations to keep in mind.
What do you actually want out of version control?
1. Version Control is an Archive
The first thing that we want is an archive that tracks all the versions and edits made to our files.
- This gives us the ability to refer to or go back to a previous version of a file if something has gone wrong, or if we want to know the history of how something has changed over time. The full archive of your data versions is often called lineage.
2. Version Control Simplifies Scalability
If we have to save a copy of a file every time we make an edit, then we’re duplicating the amount of the storage we need. This may not seem like a big deal when working with documents or source code, but if we’re working with large binary files or machine learning datasets, then storage costs can grow rapidly, and lead to the risk of fragmenting your data.
3. Version Control shouldn’t block your Workflow
We also want our version control system to facilitate a reasonable workflow.
- A developer’s time is one of the most challenging resources to manage, and the more time spent worrying about versioning issues, the less time you’re able to spend on development. In general, we want a tool where we can make incremental changes to our project without having to constantly worry about our version control system doing what we intended.
4. Version Control enhances Collaboration
Finally, we want to be able to collaborate with others.
- Developers are almost never working on a project by themselves. Therefore, we want something that can facilitate multiple users and provide a single source of truth to keep all the team members in sync.
Version Control for Software in Practice
Let’s take a look at an example of a version control tool to see how these features work together in action.
By far the most popular version control tool for software development is Git, because it enables almost all of the desired benefits of Version Control for software projects.
- Git provides us with an archive for our project, by storing all the changes made to our files in something it calls a code repository.
- The changes made to files inside this code repository are efficiently stored as snapshots and diffs, so you can always restore a specific version when you need to.
- Git also introduces the concept of branches, which facilitate a developer-friendly workflow. These branches allow us to ‘branch off’ the main development path of a project to work on an idea or feature without disturbing the main state of development. When our work is ready, we can then merge it back into the main branch, moving our project along safely.
- From the collaboration perspective, Git is best used with additional services like GitHub or GitLab. These services manage the complexities of bringing work together from different developers and maintaining that single source of truth that we need. You can see the version controlled history of Pachyderm in our GitHub.
Despite all of the amazing things that Git does for software development, it does have its limitations. In particular, what’s good for managing code and documents, isn’t always good for managing other types of data.
In our next blog, we’ll talk more about the limitations of Git and what version control for Data Science looks like.