A Guide to Data Lineage

Data Lineage means knowing, with certainty, the complete journey of your data, code, models, and the relationships between them.

What is Data Lineage?

Data lineage describes the entire life cycle of your data from start to finish. It’s knowing the entire journey your data takes over time. That’s where your data came from, how it’s processed and where it goes. It describes what happens to your data as it goes through various transformations and changes.

At every step of your pipeline, you need to understand where your data came from and how it got to its current state. While reproducibility allows you to uniquely point at a particular state of your data, lineage (sometimes also called provenance) allows you to understand the entire journey your data took to give you a specific result. In other words, it shows you context. It allows you to backtrace any surprising result to the beginning, by letting you examine each step of your analysis in tremendous detail.

Why is Data Lineage Important?

Understanding the history of your data gives you deep and highly valuable insights into why your model behaves the way it does. It also simplifies training and makes reproducing experiments much easier by letting you track the root cause of any anomalies or strange results.

How does Pachyderm Accomplish Data Lineage?

Data Lineage means the ability to track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.

1. Establish an Origin

In the first stage of the data lineage pipeline we ingest all of our data into an object store like S3 or Minio and the Pachyderm File System (PFS) take control, labeling and tagging it.

2. Track and Version

Pachyderm tracks any and all changes to your data, keeping immutable versions of each step. This allows you to see any change at will as your data moves through various pipeline stages.

3. Audit and Rollback

Pachyderm allows you to quickly audit which change made a difference in your model or deal with compliance issues with ease. Roll backwards and forwards in time to different points in time to ensure you can always reproduce any result.

Reproducibility is the Key to Better Data Science

When your data, your models and your code are all changing at the same time, how do you keep track of all the versions?

Changing data changes your experiments. If your data changes after you’ve run an experiment you can’t reproduce that experiment. Reproducibility is the essential bedrock of every data science project. For continually updated models, new data can change the performance of an algorithm as it retrains. Perhaps that new influx of information has outliers, inconsistencies, or corruptions that your team couldn’t see at the outset. Suddenly a production fraud detection model is showing too many false positives and customers are calling in upset as their accounts get suspended.

Even a simple change to the underlying data can wreak havoc on reproducible data science.

Choosing a Data Lineage Tool

Which data lineage tool is right for you? There are a few leading platforms to choose from, but you want to make sure your tool has everything you need to find success. When choosing a data lineage tool, keep an eye out for systems that offer immutability, data versioning, and team collaboration.

Immutability

Most data lineage platforms of the past failed because of a few simple reasons. The biggest one is immutability. If your system logs changes to a dataset but you can alter that data set without keeping old versions of that data, then your logging is worthless because you’ve got a log that points to a snapshot in time that no longer exists.

Versioning

Metadata logging systems, like a database, keep an audit trail, but they don’t keep the deltas between changed versions of your data. The data can change out under your nose and now all you have is an entry in the database that’s no longer reflective of the real world.

Collaboration

As your data science teams grow, it’s more crucial than ever for your teams to know who made what change and when. If one data scientist can change your data or a model and the rest of the team doesn’t know why that data changed it can set a project back by days or months. Pachyderm allows you to scale teams simply, with robust role base access control and smooth collaboration across the board.

Data Lineage & Pachyderm

In Pachyderm, lineage & metadata are ironclad. They go hand in hand every step of the way. Every change to your data gets invisibly tracked behind the scenes. You can’t go around the system and make a change that makes all your logs and commits worthless and that makes all the difference in choosing the right data lineage platform.

Pachyderm keeps track of every single piece of data -- whether it's an input, output, parameter, or model binary. All of it gets fully versioned and tracked. Lineage is an inherent property of the data, not a metadata add-on. Even if the change derives from another commit, Pachyderm captures that information to create a provenance/lineage chain that adds up to a powerful “stacktrace” for your data.

If you are ready to take advantage of the latest in data lineage, schedule a free demo with the experts at Pachyderm today!

Request a Demo