Today we are proud to announce that Pachyderm has closed a Series A financing of $10M led by Benchmark, with participation from existing investors Ace & Co, Blumberg Capital, Data Collective, Foundation Capital, Susa Ventures, Tuesday Capital, and Y Combinator. As part of the funding, Benchmark’s Chetan Puttagunta will be joining our board. We’re also announcing our next major version of Pachyderm, v1.8. We founded Pachyderm over four years ago with the understanding that data science would be critical to businesses over the next decade, and that Docker® containers were the perfect technology with which to build the next generation of tools. Those beliefs continue to strengthen with each day.
Pachyderm allows data scientists to use the tools they already know and love on unified infrastructure while maintaining complete data lineage and provenance. We believe it’s the future of how individuals and teams develop, deploy, collaborate on and communicate about all things data.
Putting The “Science” Back Into Data Science
Data science can be an incredibly powerful driver of business value, but within the “data science gold rush” of many organizations today, basic best practices devolve into a free-for-all. As data gets transformed, and fed into one algorithm after another, important scientific foundations such as reproducibility are forgotten.
Pachyderm was built to solve several common data science challenges and provide organizations with a scalable, enterprise-grade, solution where everything is 100% reproducible, debuggable, and easily understood. Our approach is what makes Pachyderm unique and stems from 3 primary principles:
1. Manage Data The Same Way We Manage Code
Over the last decade, software developers have built a rich stack of tools to allow for the management and collaboration of a shared codebase – version control systems, CI/CD, testing frameworks, and best practices to move new code through the dev → staging → production lifecycle.
Data scientists and data engineers need to deal with both code and data, but lack the tools to work with both on equal terms. Code can be forked, tested, tracked, diffed, and collaborated on. Data should be no different.
Pachyderm is the first technology to offer petabyte-scale version control for data. Pachyderm serves as a cornerstone for data management comparable to what Git has been for code. Every change, output, and result is seamlessly tracked, diffed, versioned and immediately referenceable. With Pachyderm, your entire data science or machine learning pipeline operates with the same production practices as code.
“Pachyderm helps us convert our existing data science pipelines from manually managed scripts to scalable, repeatable end-to-end workflows. They enable us to focus more on developing transformative technology to drive agriculture forward instead of wrangling infrastructure.”
- Mauricio Borgen, AgBiome’s Director of IT & Scientific Compute.
Read More about AgBiome’s use case here
2. Data Provenance & Data Lineage Are A Must
One of the trickiest parts of data science is if you change anything, you change everything. Data science is a multi-stage pipeline of data cleaning, feature extraction, transformations, and visualization. Without insight into the complete data lineage for the entire pipeline, getting concrete and repeatable results can be nearly impossible.
Data lineage not only helps you understand the changes within a data set, but also captures the relationships and dependencies between data sets and results. Pachyderm offers a complete audit trail from any conclusion through each piece of code and data that produced that result.
“Our goal was to have a verifiable history of calculations including code versions, model parameters and input and output data, in order to ensure traceability, accountability and reproducibility. We use Pachyderm to run, compose and trace the execution of these calculations to reproduce the results and fully trace the chain of data provenance. It’s saved us a lot of time from having to implement our own traceability platform.” – Anna Magdalena Kosek, Scientist and Software Integrator at The Netherlands Organization for Applied Scientific Research.
3. Data Scientists Should Be Empowered By Infrastructure
After working with countless data science teams, we have found that the most common misstep companies make is to hire several data scientists and provide them access to data, but not the infrastructure to actually deliver business value and get their models to production.
Data scientists need to be able to easily transition from their local development setup into staging and production environments. Often, the most common and useful data science tools such as Jupyter, RShiny, and Pandas aren’t supported in production infrastructures like Hadoop. Pachyderm leverages Docker® containers and Kubernetes to provide data scientists with a safe way to leverage new technologies across different cloud or on-prem infrastructures. Containers also make their analysis easily portable and shareable so others can build additional pipelines on top of existing ones.
The Next Phase For Pachyderm
We’re incredibly proud of what we’ve built to date, but we’re just getting started. To accelerate the adoption of data science and ML processes in the enterprise, our team continues to push the boundaries of the Pachyderm platform.
To that end, Pachyderm is excited to publicly announce our ongoing collaboration with BCG GAMMA to incorporate Pachyderm’s OSS product as a foundational piece of infrastructure for their newly released Source.ai machine learning platform. We’re thrilled to work with such an innovative organization to push the boundaries of data science within the enterprise.
With our enterprise partners, our latest round of funding, and Pachyderm v1.8 released, we will continue to strengthen our core open source technology, enhance our enterprise product, and pursue new and even more ambitious initiatives.
Chief among them is the introduction of a Pachyderm hub, similar to what GitHub is for git. We want to create a way for users to leverage Pachyderm, but not have to deploy their own infrastructure. Our long-term vision has always been to enable open collaboration of data science as GitHub did for open source software development, and we look forward to making that vision a reality.
If you want to help us fulfill this vision and be on the forefront of distributed data science. Join our team.