Algorithms have a long and rich history in the financial world. Today’s financial institutions are doubly focused on using data science, automated trading, machine learning (ML), and artificial intelligence (AI) to drive customer experiences and the bottom line to brand new heights. They also face a long list of regulations and that makes compliance and auditing their data utterly essential to their ability and legal livelihood.
Data Lineage means knowing, with certainty, the complete journey of your data, code, models, and the relationships between them.
Data lineage describes the entire life cycle of your data from start to finish. It’s knowing the entire journey your data takes over time. That’s where your data came from, how it’s processed and where it goes. It describes what happens to your data as it goes through various transformations and changes.
At every step of your pipeline, you need to understand where your data came from and how it got to its current state. While reproducibility allows you to uniquely point at a particular state of your data, lineage (sometimes also called provenance) allows you to understand the entire journey your data took to give you a specific result. In other words, it shows you context. It allows you to backtrace any surprising result to the beginning, by letting you examine each step of your analysis in tremendous detail.
Data science teams leverage shared data resources to build on each other’s work and insights. Pachyderm lets you work seamlessly across your team using a Git-style collaboration system that keeps track of your data, models, code and the deeply interconnected relationships between them as they change over time.Reproducibility
The Pachyderm File System (PFS) guarantees your models work the same way in development, testing, debugging, and production, whether you’re working locally, in your data centers, or the cloud. It’s Pachyderm’s iron-clad data immutability that makes all the difference. It lets you reproduce your data, development workflow, and results so you and your partners can collaborate with ease and know beyond a shadow of a doubt that you can recreate your experiments every time.Audit Trail
Pachyderm gives you the audit trail you need to comply with regulations like the GDPR. If you’re working with a customer that has the “right to be forgotten” you can go back to the point in time that his or her data flowed into the system and scrub it easily. You’ll also know every step that your customer’s information took if there is a problem that demands you show where that data came from and how it got to where it is right now.
Understanding the history of your data gives you deep and highly valuable insights into why your model behaves the way it does. It also simplifies training and makes reproducing experiments much easier by letting you track the root cause of any anomalies or strange results.
Data Lineage means the ability to track any result all the way back to its raw input, including all analysis, parameters, code, and intermediate results.
In the first stage of the data lineage pipeline we ingest all of our data into an object store like S3 or Minio and the Pachyderm File System (PFS) take control, labeling and tagging it.
Pachyderm tracks any and all changes to your data, keeping immutable versions of each step. This allows you to see any change at will as your data moves through various pipeline stages.
Pachyderm allows you to quickly audit which change made a difference in your model or deal with compliance issues with ease. Roll backwards and forwards in time to different points in time to ensure you can always reproduce any result.
When your data, your models and your code are all changing at the same time, how do you keep track of all the versions?
Changing data changes your experiments. If your data changes after you’ve run an experiment you can’t reproduce that experiment. Reproducibility is the essential bedrock of every data science project. For continually updated models, new data can change the performance of an algorithm as it retrains. Perhaps that new influx of information has outliers, inconsistencies, or corruptions that your team couldn’t see at the outset. Suddenly a production fraud detection model is showing too many false positives and customers are calling in upset as their accounts get suspended.
Life Sciences companies always stand at the cutting edge of technology, whether that’s biotechnology or industrial gene sequencing or computers. Today machine learning (ML) and artificial intelligence (AI) adoption have surged inside the halls of biotech companies big and small. Just as genetics engineers and lab techs need the repeatable, testable power of the scientific method, data science teams need reproducibility across their AI/ML pipelines so they can recreate all their results. Data versioning easily ensures the safety and efficacy of any findings as their machine learning models discover new drugs, treatments and promising new experimental pathways.
Technology powers the cars of today and tomorrow. It wasn’t long ago that cars had only analog tech under the hood, but now cars are rolling computers with everything from anti-lock breaks to adaptable cruise control systems powered by silicon and algorithms. Everything from designing new cars, to the very cars themselves employ machine learning from top to bottom. When you’re working with a machine that interacts in the real world, safety and security are paramount. You need the ability to deliver explainable, repeatable, and scalable data science at scale to design a safe and top selling new car. Whether you’re designing new cars for increased aerodynamics or creating next-gen autonomous vehicles, or Advanced Driver Assistance Systems (ADAS) machine learning is now at the heart of the automotive industry.
Most data lineage platforms of the past failed because of a few simple reasons. The biggest one is immutability. If your system logs changes to a dataset but you can alter that data set without keeping old versions of that data, then your logging is worthless because you’ve got a log that points to a snapshot in time that no longer exists.
Metadata logging systems, like a database, keep an audit trail, but they don’t keep the deltas between changed versions of your data. The data can change out under your nose and now all you have is an entry in the database that’s no longer reflective of the real world.
As your data science teams grow it’s more crucial than ever for your teams to know who made what change and when. If one data scientist can change your data or a model and the rest of the team doesn’t know why that data changed it can set a project back by days or months. Pachyderm allows you to scale teams simply, with robust role base access control and smooth collaboration across the board.
In Pachyderm, lineage & metadata are ironclad. They go hand in hand every step of the way. Every change to your data gets invisibly tracked behind the scenes. You can’t go around the system and make a change that makes all your logs and commits worthless and that makes all the difference in choosing the right data lineage platform.
Pachyderm keeps track of every single piece of data -- whether its an input, output, parameter, or model binary. All of it gets fully versioned and tracked. Lineage is an inherent property of the data, not a metadata add-on. Even if the change derives from another commit, Pachyderm captures that information to create a provenance/lineage chain that adds up to a powerful “stacktrace” for your data.Request a Demo