What is Version Control?


What is version control?

Version control is a crucial part of almost everyone’s development. Whether you know it or not, most people are doing some level of version control (whether they’re working with documents, source code, or almost any digital file on their file system). 

Some methods of version control are obviously stronger than others. For example, saving files with different names to manage their respective versions is a form of version control, albeit a highly error prone and messy process. 

A better way is to use tooling to version our files and keep things organized, without interrupting our workflow. But when it comes to tooling, there are a few considerations to keep in mind. 

So, first let’s take a step back and think about what we actually want out of version control.

1. Archive

The first thing that we want is an archive that tracks all the versions and edits made to our files.

2. Scalability

If we have to save a copy of a file every time we make an edit, then we’re duplicating the size of the storage we need. This may not seem like a big deal when we’re working with documents or source code, but if we’re working with large binary files or machine learning datasets, then storage costs can grow rapidly.

3. Workflow

We also want our version control system to facilitate a reasonable workflow. 

4. Collaboration

Finally, we want to be able to collaborate with others. 

Version Control in Practice

Now, Let’s take a look at an example of a version control tool to see how these features work together in action. 

By far the most popular version control tool for software development is Git. And this is because it enables almost all of the desired benefits of Version Control for software projects. 

  1. Git provides us with an archive for our project, by storing all the changes made to our files in something it calls a code repository.
  2. The changes made to files inside this code repository are efficiently stored as snapshots and diffs, so you can always restore a specific version when you need to. 
  3. Git also introduces the concept of branches, which facilitate a developer-friendly workflow. These branches allow us to ‘branch off’ the main development path of a project to work on an idea or feature without disturbing the main state of development. When our work is ready, we can then merge it back into the main branch, moving our project along safely. 
  4. From the collaboration perspective, Git is best used with additional services like GitHub or GitLab. These services manage the complexities of bringing work together from different developers and maintaining that single source of truth that we need. 

Concluding Thoughts

Despite all of the amazing things that Git does for software development, it does have its limitations. In particular, what’s good for managing code and documents, isn’t always good for managing other types of data.

In our next blog, we’ll talk more about the limitations of Git and what version control for Data Science looks like.  

Check out our corresponding video to learn more about Data Versioning.