Pachyderm Hub 2 Delivers the Data Foundation for Your Machine Learning Lifecycle



Our most cutting edge release to date allows teams to rapidly productionize and scale their ML projects with free data storage that helps reduce costs

Today we’re announcing the release of Pachyderm Hub 2, the latest version of Pachyderm’s premium cloud offering. Hub 2 gives data science teams the data layer they need to deliver total reproducibility throughout the entire machine learning lifecycle. Data science teams already know Pachyderm’s unique data versioning and data lineage combination lets teams track every change to their models, code, and data in addition to showing how they all relate to each other from ingestion to serving. Our latest version, Hub 2, builds on this powerful foundation and delivers a range of essential new features, like Global Identifiers, Jupyter notebook support, a dynamic new web console, and a pricing plan that includes free data storage, not to mention data versioning and pipeline speed improvements.

Data science and MLOps teams face the challenge of slow and painful product releases, where it’s hard to debug a model in production, if not impossible. How can you know what exact data set was used to train the model? Has it changed? Can you know for sure that it’s the same data when we have to retrain? Was anything added or deleted by another team? Teams quickly realize they need automated data versioning and immutable data lineage to iterate at scale.

Pachyderm Hub 2 introduces key new features, like Global Identifiers (Global IDs) and a new storage layer, that allows teams to scale their machine learning lifecycle and deliver faster. Global IDs give data scientists, data engineers, and operations improved visibility, complete reproducibility, and data debugging in one Global ID that brings together all the data, jobs, parameters and code that your team used to deliver a final result. Global IDs make every aspect of lineage tracking much easier to understand and much more cost-effective. With one Global ID you can discover exactly what dataset was used, what state it was in, what version of the code ran against it, and more.

Pachyderm Hub 2 also introduces a brand new storage layer that uses FileSets and chunk-based deduplication to increase file processing performance while lowering costs. FileSets are precomputed sets of 64-byte chunks of files used to rapidly assemble a filesystem to reproduce a result, including the knowledge of what files don’t need to be reprocessed. When you have to continually train a model, knowing exactly what data to skip can lead to massive improvements in speed because your re-training can skip data it’s already processed and only train on new information. In fact, Pachyderm customers like LivePerson have seen up to ten times improvements in speed.

Beyond knowing what to skip, chunk-based deduplication allows compact storage of all data based on 64-byte segments of data from all files. A given “chunk” is stored once and only once in object storage. “Knowing what to skip” means that computation costs are saved because data isn’t reprocessed unnecessarily. “Chunk-based deduplication” means that when a file is needed for actual processing, it can be rapidly recreated from ultra-efficient storage. Both approaches ensure that costs are driven down throughout the machine learning lifecycle.

“Pachyderm Hub 2 is so efficient at storing data that we’re able to offer a pricing model that isn’t metered on storage,” says Joe Doliner, CEO of Pachyderm. “Users can gain complete reproducibility at scale without manually versioning their data, creating complex infrastructure and code, or overspending on processing and storage. Pachyderm Hub 2 makes reproducibility at scale easy and cost-effective.”

Fast and flexible pipelines have always been one of the platform’s strengths, and now with Pachyderm Hub 2 features pipelines have up to four times faster job performance due to our new storage layer, as well as lower costs with the new autoscaling settings.

We’ve also included a beta of Pachyderm Notebooks with this latest release, allowing for much faster prototyping and easier iteration. If you’re already familiar with Jupyter, you’ll love having a brand new way to interact with Pachyderm in a way that’s very familiar to data scientists everywhere.


Lastly, the new Pachyderm Console, allows for easy debugging of complex directed acyclic graphs (DAGs), visualizing them and allowing you troubleshoot them in a more intuitive way.

Data lineage

Try Pachyderm Hub 2 for Free at

About the Author

Joey Zwicker

Joey is Co-Founder and COO of Pachyderm. He is a multitasking wizard and leverages his technical background to be better at the enormous breadth of tasks actually on his plate.