How Does Pachyderm Accomplish Data Lineage

  1. Establish an Origin.

    As raw data pours in from different sources it gets processed by many teams and transformed a hundred different ways. To recreate any model or result you must be able to trace every change to your data to really understand why that model works or fails.

  2. Track and Version

    Unlike most AI/ML platforms which force you into the error prone process of manually tracking changes, Pachyderm automatically tracks every change to your data, keeping immutable versions of code, intermediate results, models and the metadata that ties it all together.

  3. Audit and Rollback

    Pachyderm allows you to quickly audit differences in your model or to deal with complex compliance issues with ease. Roll backwards and forwards to different points in time to ensure you can always reproduce any result or correct any anomalies.

Why is Data Lineage Important to Your Business?

Collaboration

Data science teams leverage shared data resources and to build on each other’s work and insights. Pachyderm lets you work seamlessly across your team using a Git-style collaboration system.

Reproducibility

The Pachyderm File System’s (PFS) iron clad immutability guarantees your models work the same way in development, testing, debugging, and production, whether you’re working locally, in your data centers, or the cloud.

Audit Trail

Pachyderm gives you the audit trail you need to comply with regulations like the GDPR. If you’re working with a customer that has the “right to be forgotten” you can go back to the point in time that their data flowed into the system and scrub it easily.

How Does Pachyderm Accomplish Data Lineage?

Pachyderm’s data lineage platform tracks the machine learning development lifecycle from start to finish. As a model makes its way through training and into production, flowing along a DAG (Directed Acyclic Graph), Pachyderm keeps a complete record of the model’s every move. It does that with commits, similar to Git, but for data, models and code. Pachyderm’s versioning file system keeps every change no matter how small. That means no data can change after you’ve logged its metadata, unlike other platforms, which aren’t immutable. Data scientists and data engineers can track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.

What is Data Lineage & Why is it Important

Data lineage means the entire life cycle of your data from start to finish. It’s knowing the complete journey your data takes over time. It describes what happens as your data goes through various transformations and changes.

In AI/ML that means tracking changes to data, your models, your results and your code, as well as how all those changes link together. Data science teams may track 100s of models and do 1000s of training sessions and experiments. Pachyderm’s data lineage system lets them reproduce any of those training results perfectly so they can see what went right or what went wrong.

Reproducibility is the Key to Better Data Science

When your data, models, parameters, and code are all changing at the same time, how do you keep track of all the versions and permutations?

Changing data changes your experiments. If your data changes after you’ve run an experiment you can’t reproduce that experiment. For continually updated models, new data can change the performance of an algorithm as it retrains. Maybe that new influx of information has outliers, inconsistencies, or corruptions that your team couldn’t see at the outset. Suddenly a production fraud detection model is showing too many false positives and customers are calling in upset as their accounts get suspended.

Even a simple change to the underlying data can wreak havoc on reproducible data science.

Industries That Can Benefit From Data Lineage

Automotive

Auto makers increasingly turn to AI/ML for everything from designing new cars, to teaching cars to drive themselves. When you’re building a machine that works in the real world, safety and security are essential. You need the ability to deliver explainable and repeatable data science at scale to create a safe and top selling new car. Whether you’re designing new cars for increased aerodynamics, or Advanced Driver Assistance Systems (ADAS) Pachyderm can form the foundation for your data science teams to create the smart cars of today and tomorrow.

Learn more
Oil & Gas

Oil and Gas companies use ML to unlock insights from deposits hidden deep in their data. Whether you’re running through seismic and geographical reports or tracking fleets of sensors all through the energy delivery pipeline you need strong data lineage. Anomalies in that data can cause tremendous damage to systems in the real world. Oil & Gas companies leverage Pachyderm’s data lineage to make sure they can track any problem back to its source.

Choosing a Data Lineage Tool

Immutability

Most data lineage platforms fail because they lack immutability. If your system logs changes to a dataset but you can alter that data without keeping older versions, then your metadata logging is worthless because your log points to a snapshot that no longer exists. Without immutability you can’t trust your data.

Versioning

A good ML platform keeps all versions of your data because data can and does change over time. It’s not enough to log what directory of files your model trained on. You have to know which version of the data was on that mount, and how to get to that version, or you can’t recreate your run.

Collaboration

As your data science teams grow it’s more crucial than ever for them to know who made what change and when. If one data scientist can change your data or a model and the rest of the team doesn’t know why that data changed it can set a project back by days or months. Pachyderm allows you to scale teams simply, with robust role based access control and smooth collaboration across the board.

Data Lineage & Pachyderm

In Pachyderm, lineage & metadata are intrinsically linked. They go hand in hand every step of the way. Every change to your data gets invisibly tracked behind the scenes. You can’t go around the system and make a change that makes all your logs and commits worthless.

Pachyderm fully versions and tracks every input, output, parameter, or model binary. Lineage is an inherent property of the data, not a metadata add-on.

Even if the change derives from another commit, Pachyderm captures that information to create a provenance/lineage chain that adds up to a powerful “stacktrace” for your data.

Request a Demo