The modern vehicle is incredibly intelligent. Computers continue to revolutionize our rides with collision warnings, blind spot detection and automatic breaking. Pachyderm helps the auto industry version control petabytes of data to power tomorrow's breakthroughs like cars that drive themselves and alert emergency services when they need help.
Data Lineage means knowing, with certainty, the complete journey of your data, code, models, and the relationships between them.
Data lineage describes the entire life cycle of your data from start to finish. It’s knowing the entire journey your data takes over time. That’s where your data came from, how it’s processed and where it goes. It describes what happens to your data as it goes through various transformations and changes.
At every step of your pipeline, you need to understand where your data came from and how it got to its current state. While reproducibility allows you to uniquely point at a particular state of your data, lineage (sometimes also called provenance) allows you to understand the entire journey your data took to give you a specific result. In other words, it shows you context. It allows you to backtrace any surprising result to the beginning, by letting you examine each step of your analysis in tremendous detail.
Data lineage is essential to every industry that uses data science tools and technology. It provides three elements necessary for successful data science: Collaboration, Reproducibility, and Audit Trails.Collaboration: Share Insights For Better Data Science
Data science teams leverage shared data resources to build on each other’s work and insights. Pachyderm lets you work seamlessly across your team using a Git-style collaboration system that keeps track of your data, models, code and the deeply interconnected relationships between them as they change over time.Reproducibility: Recreate Your Results Every Time
The Pachyderm File System (PFS) guarantees your models work the same way in development, testing, debugging, and production, whether you’re working locally, in your data centers, or the cloud. It’s Pachyderm’s iron-clad data immutability that makes all the difference. It lets you reproduce your data, development workflow, and results so you and your partners can collaborate with ease and know beyond a shadow of a doubt that you can recreate your experiments every time.Audit Trail: Regulation Compliance Made Easy
Pachyderm gives you the audit trail you need to comply with regulations like the GDPR. If you’re working with a customer that has the “right to be forgotten” you can go back to the point in time that his or her data flowed into the system and scrub it easily. You’ll also know every step that your customer’s information took if there is a problem that demands you show where that data came from and how it got to where it is right now.
Understanding the history of your data gives you deep and highly valuable insights into why your model behaves the way it does. It also simplifies training and makes reproducing experiments much easier by letting you track the root cause of any anomalies or strange results.
Data Lineage means the ability to track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.
In the first stage of the data lineage pipeline we ingest all of our data into an object store like S3 or Minio and the Pachyderm File System (PFS) take control, labeling and tagging it.
Pachyderm tracks any and all changes to your data, keeping immutable versions of each step. This allows you to see any change at will as your data moves through various pipeline stages.
Pachyderm allows you to quickly audit which change made a difference in your model or deal with compliance issues with ease. Roll backwards and forwards in time to different points in time to ensure you can always reproduce any result.
When your data, your models and your code are all changing at the same time, how do you keep track of all the versions?
Changing data changes your experiments. If your data changes after you’ve run an experiment you can’t reproduce that experiment. Reproducibility is the essential bedrock of every data science project. For continually updated models, new data can change the performance of an algorithm as it retrains. Perhaps that new influx of information has outliers, inconsistencies, or corruptions that your team couldn’t see at the outset. Suddenly a production fraud detection model is showing too many false positives and customers are calling in upset as their accounts get suspended.
Banking thrives on algorithms that do everything from fraud detection, to high frequency trading. Now a new age of intelligent applications demands Pachyderm’s ability to easily chain together dozens of cutting-edge frameworks to give banks the edge they need to compete in the modern market.
The scientific method’s reproducibility gave birth to the scientific revolution. As biotech firms turn to AI/ML to drive next generation drug discovery, they need a new kind of reproducibility: Pachyderm’s data lineage delivers the scientific method data scientists need to create scalable, repeatable experiments now.
Most data lineage platforms of the past failed because of a few simple reasons. The biggest one is immutability. If your system logs changes to a dataset but you can alter that data set without keeping old versions of that data, then your logging is worthless because you’ve got a log that points to a snapshot in time that no longer exists.
Metadata logging systems, like a database, keep an audit trail, but they don’t keep the deltas between changed versions of your data. The data can change out under your nose and now all you have is an entry in the database that’s no longer reflective of the real world.
As your data science teams grow, it’s more crucial than ever for your teams to know who made what change and when. If one data scientist can change your data or a model and the rest of the team doesn’t know why that data changed it can set a project back by days or months. Pachyderm allows you to scale teams simply, with robust role base access control and smooth collaboration across the board.
In Pachyderm, lineage & metadata are ironclad. They go hand in hand every step of the way. Every change to your data gets invisibly tracked behind the scenes. You can’t go around the system and make a change that makes all your logs and commits worthless and that makes all the difference in choosing the right data lineage platform.
Pachyderm keeps track of every single piece of data -- whether it's an input, output, parameter, or model binary. All of it gets fully versioned and tracked. Lineage is an inherent property of the data, not a metadata add-on. Even if the change derives from another commit, Pachyderm captures that information to create a provenance/lineage chain that adds up to a powerful “stacktrace” for your data.
If you are ready to take advantage of the latest in data lineage, schedule a free demo with the experts at Pachyderm today!Request a Demo