4 Reasons to Get Excited About Pachyderm 2019

2018 was an exciting year for Pachyderm. The growth we saw for the project, user base, and the company has surpassed our expectations. To give you an idea, Pachyderm’s codebase had more than 1,800 commits added, 1.5 million lines changed, and 24 new external contributors. We saw a spike in community generated content as well. Key examples include an excellent and in-depth blog written by Dr. Samantha Zeitlin the lead data scientist at Denali publishing and a great talk on modern data science delivered at Go Northwest by Sam Kreter, a software development engineer at Microsoft.

Furthermore, we had more than one thousand new users joining our support channels to ask questions and provide others with advice. It’s very rewarding for us to see the momentum build around Pachyderm and based on what we observed in 2018, it would seem we’re well on our way to building a sustainable open source community.

Community momentum also fueled Pachyderm’s commercial growth in 2018. While helping numerous customers reshape data science within their organization; Pachyderm released two major versions of the product, created a few case studies, and welcomed five new team members. We even managed to raise some money. Needless to say, we’re thrilled. But that was sooo last year…

Here’s a quick look at what you can expect from the Pachyderm project this year

1. Expanded Pipelines and Native Kubernetes

Pachyderm pipelines will be extended to support even more complex workflows and heavier workloads. This will enable us to enhance Pachyderm pipelines while also tightening our integration with Kubernetes and the surrounding ecosystem:

  • Pachyderm pipelines will be implemented as K8s CRDs, essentially making Pachyderm pipelines behave like any other native Kubernetes object.
  • Better integration with Kubeflow and its components (TFJobs, PyTorch training, etc.)
  • Support for more sophisticated data flows such as loops for iterative training.
  • Templating of pipelines.
  • One-off pipeline jobs.
  • Greater control over which pipeline steps get persisted for faster experimentation and more granular storage control for versioned data.

You can find out more information here: https://github.com/pachyderm/pachyderm/issues/3345

2. New Ways to Access and Manage Data

This year we want to expand the many different ways you can integrate data into and out of Pachyderm:

  • The Pachyderm filesystem (PFS) will now be accessible directly through an S3-compatible API, making versioned data in PFS just as accessible as if it were in an S3 bucket, but still with the powerful versioning semantics of Pachyderm
  • Kubernetes CSI (Container Storage Interface) to mount PFS data into any K8s pod This one is already underway and you can follow along here: https://github.com/pachyderm/pachyderm/pull/3432

3. Improved Storage Engine

Performance is paramount when it comes to applied data science, full stop. In 2018, we made drastic performance improvements to Pachyderm, and 2019 will be no different. Users will continue to see performance improvements for nearly every type of workload throughout the year.

4. An Entirely New Way to Interact with Pachyderm

This one we’re keeping a bit closer to the chest, as it’s too early to provide details. What we will say is that it’s a natural continuation of our mission to enable reproducible data science and facilitate collaboration, but on a global scale. How we’re going to do that is by making Pachyderm more accessible and eliminating infrastructure obstacles that stand in the way of real-world ML/AI.