Pachyderm delivers the key platform data scientists need to develop deep and rigorous data analysis at scale. These core tenets infuse every aspect of our design, so you can focus on data science instead of machine learning operations (MLOps).
Reproducibility provides the fundamental building block of strong data science, by letting you consistently reconstruct any previous state of your data, model and analysis.
The scientific method relies on the feedback loop of repeatedly testing hypotheses. To truly test your hypothesis, you need strict controls that let you evaluate and compare results with absolute consistency. At every step in the design and delivery of your model you need to flawlessly reproduce the input data, the output data, and the analysis in exactly the same way.
Your data is sacred. To protect that data, you need immutability.
Other data science platforms let people change data outside of your pipelines. When that happens, any references to that data become worthless. Your workflow now points to something that no longer exists and that means you can’t recreate an earlier result. That’s a major problem for data science teams because even minor changes to the input can have drastic results on the output model.
The Pachyderm File System (PFS) delivers iron-clad data immutability that lets you reproduce your data, development workflow, and results so you and your partners can collaborate with ease. We leverage existing object stores like Amazon S3 or MinIO to bring you incredibly scalable and dependable data storage, while adding immutability primitives, that won’t change without you realizing it.
It also guarantees your models work the same way in development, testing, debugging, and production, whether you’re working locally, in your data centers, or the cloud.
Of course, your data is more than just your inputs. Parameters, intermediate results, model binaries and metadata are all data to Pachyderm and all of it gets versioned perfectly.
Beyond your data, your execution environment needs reliable and consistent versioning too.
Pachyderm leverages Docker containers to packages up your execution environment and code into a smooth containerized workflow.
Combine containers with highly reproducible data sets and you’ve got a reliable, single source of truth for all the essential parts of your pipeline. That means your data, your tools, and your code flow seamlessly across every one of your environments.
This allows your team to recreate the exact conditions that led to your results. Your colleagues can follow your work step by step and reproduce your work perfectly, every single time.
Data Provenance means the ability to track any result all the way back to its raw input, including all analysis, code, and intermediate results.
At every step of your data pipeline, you need to understand where the data came from and how it reached its current state. While Reproducibility allows you to uniquely point at a particular state of your data, Provenance allows you to understand the journey your data took to give you a specific result. In other words, it shows you context. It allows you to backtrace a surprising result to the beginning, by letting you examine each step of your analysis.
In Pachyderm, every output result is fundamentally linked to its input data and analysis code so you can fully trace its origins as if following beads on a string.
Without Data Provenance, you can’t tell the difference between a meaningful new result and a flaw in the analysis.
Collaboration means the ability for a data scientist team to leverage shared data resources and build on each other’s work.
Reproducibility and Provenance provide the building blocks for effective collaboration across your team. When you share your analysis, you know for a fact that everyone else on your team can recreate your results. When your team can retrace the steps of your process, that means they can fully understand how you got from point A to point B and they can now build on top of your work and you can build on theirs.
When you have a team developing models in parallel, resolving changes can get difficult fast. Changes to the data, its format, the analysis code, and execution environment can all cause breakdowns in your pipeline.
Pachyderm lets you work seamlessly across your team using a Git-style collaboration system that lets you keep track of your data, your models, your code and the deeply interconnected relationships between them as they change over time.
Incrementality keeps your results synchronized with your input data and prevents redundant processing.
In most large scale data infrastructures, jobs form a DAG (Directed Acyclic Graph) of transformation steps that build into a dependency graph. Typical dependency schedulers, such as Airflow, Luigi or even Cron, all have one major flaw – they lack incrementality.
Incrementality enables your analysis to stay in sync with your input data. By keeping track of exactly what data has been updated, Pachyderm’s scheduler leverages our powerful data versioning features to only process new or changed data, which eliminates redundant processing, allowing your analysis to finish much faster.
In other words, Pachyderm processes data only when there is new work to do. That means your DAG of pipelines stays in sync with your data automatically without needing constant manual intervention and care.