Version Control for Data Science

unnamed 1

In our last blog, we talked about version control and how it focuses on tracking and managing changes in software development. 

But what about version control in data science? 

In many industries like healthcare and finance, reproducibility and model compliance are requirements for machine learning. But version control can do a lot for data science like making your experimentation faster, more reliable, and even organize changes when working with a team.

Version Control for Machine Learning

The main difference between version control for machine learning when compared to software development is that there are more moving pieces – the two main pieces being our machine learning code and our data. 

When applying machine learning in the real world, it’s common to see development happening on both of these. We constantly incorporate new observations into our dataset to keep our model relevant, while also applying new techniques and training methods in our model development.

So what we end up with is a symbiotic relationship between our code and our data. 

But managing the interactions of our code with our data can become unwieldy. For example, if our model is making an incorrect prediction, it could be an issue with our code or with our data. In order to investigate this, we need to be able to trace the source of this error, investigating the history of changes to our code and configuration, as well as any changes to our data. 

This is a problem when we want something to be reproducible, because we must be able to recover the correct versions of each component that contributed to the outcome. 

In data science, this means versioning all the different parts that came together to create our model. This includes our code, configuration (such as hyperparameters), and our data. Even the versions of the libraries we’re using can have a huge impact on the outcome of our model.

Simply put, there are 3 areas that need to be considered when applying version control to data science. 

  1. Code
  2. Data
  3. Pipelines (Code + Data)

Code

As we saw in our previous video, version control for code is largely solved with tools like Git and GitHub. We can reliably version our code and work with it in a collaborative and reliable manner.

Data

In the case of data, we also need something that can track and version our changes, but this is a very different problem. If we take a look at Git, it is specifically designed for code in the form of text files. So when it comes to data, there’s really a mismatch. 

And this mismatch occurs for a few reasons. Our datasets are typically: 

In order to find tooling that can version and manage our data, we need something that satisfies all of these requirements.

Most tooling tends to rely on object storage at some level, but we must also consider the other components of version control when evaluating tooling like storage costs, workflow, and collaboration.

Pipelines

The third piece of the puzzle for reproducibility is tracking how we’re combining our data with our code. The simplest example of this is the model training process, but there are usually many stages before we get there, where we’re preprocessing or formatting our datasets to be ready for training.

These pipelines need to contain all of our code and configuration along with any library or system requirements that may be needed to transform data into the expected output.

But tracking each of these components is where the complexity begins to creep in. Usually, you have to sacrifice one or more of them with most tools. Or commit to building glue software to manage the interactions yourself . 

Like we saw with Git, it is the perfect library for managing our code locally, but it requires something more to manage collaboration across a team. 

So the real question is: How do we version our data, while also ensuring that when code is applied to it, we can track what’s happened in a reliable and reproducible way.

In the next blog, we’ll talk about how we’re trying to solve these problems from the bottom up with Pachyderm. How we version and manage data, and how you can create reproducible pipelines that give you a full lineage of everything that’s happened. 

Check out our corresponding video to learn more about Data Versioning for Data Science.