Pachyderm Data Science Bill of Rights

Pachyderm delivers the key platform data scientists need to develop deep and rigorous data analysis at scale. These core tenets infuse every aspect of our design, so you can focus on data science instead of machine learning operations (MLOps).

Reproducibility

Reproducibility provides the fundamental building block of strong data science, by letting you consistently reconstruct any previous state of your data, model and analysis.

The scientific method relies on the feedback loop of repeatedly testing hypotheses. To truly test your hypothesis, you need strict controls that let you evaluate and compare results with absolute consistency. At every step in the design and delivery of your model you need to flawlessly reproduce the input data, the output data, and the analysis in exactly the same way.

Reproducible Data

Your data is sacred. To protect that data, you need immutability.

Other data science platforms let people change data outside of your pipelines. When that happens, any references to that data become worthless. Your workflow now points to something that no longer exists and that means you can’t recreate an earlier result. That’s a major problem for data science teams because even minor changes to the input can have drastic results on the output model.

The Pachyderm File System (PFS) delivers iron-clad data immutability that lets you reproduce your data, development workflow, and results so you and your partners can collaborate with ease. We leverage existing object stores like Amazon S3 or MinIO to bring you incredibly scalable and dependable data storage, while adding immutability primitives, that won’t change without you realizing it.

It also guarantees your models work the same way in development, testing, debugging, and production, whether you’re working locally, in your data centers, or the cloud.

Of course, your data is more than just your inputs. Parameters, intermediate results, model binaries and metadata are all data to Pachyderm and all of it gets versioned perfectly.

Reproducible Execution

Beyond your data, your execution environment needs reliable and consistent versioning too.

Pachyderm leverages Docker containers to packages up your execution environment and code into a smooth containerized workflow.

Combine containers with highly reproducible data sets and you’ve got a reliable, single source of truth for all the essential parts of your pipeline. That means your data, your tools, and your code flow seamlessly across every one of your environments.

This allows your team to recreate the exact conditions that led to your results. Your colleagues can follow your work step by step and reproduce your work perfectly, every single time.

Data Provenance

Data Provenance means the ability to track any result all the way back to its raw input, including all analysis, code, and intermediate results.

At every step of your data pipeline, you need to understand where the data came from and how it reached its current state. While Reproducibility allows you to uniquely point at a particular state of your data, Provenance allows you to understand the journey your data took to give you a specific result. In other words, it shows you context. It allows you to backtrace a surprising result to the beginning, by letting you examine each step of your analysis.

In Pachyderm, every output result is fundamentally linked to its input data and analysis code so you can fully trace its origins as if following beads on a string.

Without Data Provenance, you can’t tell the difference between a meaningful new result and a flaw in the analysis.

Collaboration

Collaboration means the ability for a data scientist team to leverage shared data resources and build on each other’s work.

Reproducibility and Provenance provide the building blocks for effective collaboration across your team. When you share your analysis, you know for a fact that everyone else on your team can recreate your results. When your team can retrace the steps of your process, that means they can fully understand how you got from point A to point B and they can now build on top of your work and you can build on theirs.

When you have a team developing models in parallel, resolving changes can get difficult fast. Changes to the data, its format, the analysis code, and execution environment can all cause breakdowns in your pipeline.

Pachyderm lets you work seamlessly across your team using a Git-style collaboration system that lets you keep track of your data, your models, your code and the deeply interconnected relationships between them as they change over time.

Incrementality

Incrementality keeps your results synchronized with your input data and prevents redundant processing.

In most large scale data infrastructures, jobs form a DAG (Directed Acyclic Graph) of transformation steps that build into a dependency graph. Typical dependency schedulers, such as Airflow, Luigi or even Cron, all have one major flaw – they lack incrementality.

Incrementality enables your analysis to stay in sync with your input data. By keeping track of exactly what data has been updated, Pachyderm’s scheduler leverages our powerful data versioning features to only process new or changed data, which eliminates redundant processing, allowing your analysis to finish much faster.

In other words, Pachyderm processes data only when there is new work to do. That means your DAG of pipelines stays in sync with your data automatically without needing constant manual intervention and care.

Autonomy and Infrastructure Abstraction

Autonomy means the ability of data scientists to have complete control over their toolchain and deployment.

Your tools are your choice.

Whether you’re picking tools for productivity and convenience, or for advanced features and optimizations, your choice makes all the difference in the world for your projects and your team. If you have to wait for a cloud based system to port the tool you need you’ve already lost. Even worse, if the tool you need never gets support you may need to move to a different platform which crushes productivity as your team struggles to reinvent the wheel.

Pachyderm is built on Kubernetes (K8s) and containers, which allows for fully automated and scalable workflows while remaining totally agnostic to your choice of tools. Because Pachyderm provides a perfectly clean abstraction between data science and the infrastructure underneath, it puts data scientists in charge.

Bring whatever tools you need to your pipeline. If you can deploy the tool in a container you can use it in concert with Pachyderm.

That guarantees not just freedom for your tools but control over deployment too. It allows you to know with confidence that the same libraries and frameworks you used on your laptop flow into your training environments and into production with perfect consistency. Turn the exact same container over to your infrastructure team and they can run it at scale without having to worry about what’s inside the box after they’ve put it through their security and scanning systems.

We made Pachyderm simple to get into production, regardless of your current data storage, cloud or hosted/on-prem infrastructure. Run Pachyderm on your existing infrastructure in your local data center or move them to the cloud and back again.

By abstracting out your tools, code, models and the references to your data we make your data science workflows supremely portable. The only requirement for deploying Pachyderm in production is K8s and Docker.

If you can deploy and run containers, you can use Pachyderm.