Accelerate Collaboration With Pachyderm 2.5

Bhavani Rao

March 7, 2023

Pachyderm 2.5 is generally available — the focus of this release is about accelerating collaboration. We’re making it easier to build sophisticated data pipelines and share Directed Acyclic Graphs (DAGs). This is the first release of 2023 and marks the platform’s debut as a part of Hewlett Packard Enterprise. The launch underscores Pachyderm’s unwavering commitment to scaling machine learning workloads.

Reduce barriers to collaboration with Projects

When teams build data pipelines, they often need the freedom to experiment and explore freely without affecting the work of others. Isolation during collaboration is crucial to ensure that each team can work independently and securely.

Introducing Projects – a logical grouping of repositories and pipeline steps that makes sharing and collaboration effortless. Now, individuals or teams can access and view specific repos and pipeline steps, creating in effect, modular building blocks. Projects also offer greater naming flexibility, as repository and pipeline steps names only need to be unique within a Project, instead of being unique across the cluster. Thus providing a simpler way to manage and share your work.

More importantly, Projects enable better isolation and organization of data and development workflows. Previously, every team could view and access one DAG’s pipelines and repos. Now, a team can organize and create their own DAGs and experiment freely without fear of impacting other teams, with full control over who can access and edit different projects. With Projects, multiple teams now have the option to work independently of each other and collaborate more effectively.

Projects are fully supported in Pachctl and our python-pachyderm client. Users can interact with Project-scoped repos and pipelines from the JupyterLab extension and S3 Gateway as well. In addition, users can view and access projects from Console Web UI as shown below.

Increasing Developer Productivity

Pachyderm 2.5 also introduces a number of enhancements to boost developer productivity, simplifying the creation and debugging of intricate data or ML pipelines. Four improvements worth calling out are Kubernetes events in Pachyderm, datum debugging in Console, visualization of complex DAGs, and Pachctl logging.

Access Kubernetes Events in Pachyderm

Built on Kubernetes, Pachyderm uses the orchestration capabilities of containers and pods. However, pipeline failures caused by underlying infrastructure issues can be challenging to diagnose quickly. Starting with Pachyderm 2.5, users can access Kubernetes pod events (informational, warnings and errors) using Pachctl. This feature expedites debugging by eliminating the need to scroll through logs and the disjointed experience of accessing information in two different platforms; Kubernetes and Pachyderm. Try it yourself using the following command.

pachctl  kube-events

Improved Datum Debugging in Console

Console highlights the high-level status of datums (success, skipped, failed, recovered). When a job fails, you need to zero in on which datums failed with the fewest clicks. On large jobs, scrolling through logs to surface the failed datums can take hours. In release 2.5, users can navigate directly to any datum by clicking on the pipeline step and then using the filter in the right-side panel. From the filtered list of datums they can drill down into a specific one. The benefit of this feature is that users can easily and intuitively access and diagnose failing jobs to identify the root cause; resulting in faster time to resolution and a better understanding of processing performance.

Datum viewer in Console

Improved Visualization of Complex DAGs

Console provides a great way for users to visualize DAGs. New updates to Console allow for highly performant rendering of DAGs encompassing hundreds of pipeline steps and repos. Users can zoom in and zoom out quickly without having to wait for the system to re-render the entire DAG.

Complex DAGs rendered instantaneously in Console

Audit Log of Pachctl Commands

Every command executed from a Pachyderm client (such as Pachctl or the python-pachyderm API, for example) is now logged, along with the time it was executed and, if auth is enabled, the ID of the user who executed it. This creates a minimal, but valuable, audit trail. This can be extremely useful to have, if for example, a user or colleague is troubleshooting an issue and needs to verify what actions may or may not have been performed in the past. Learn how to view the audit logs in our documentation.

Performance Improvements

New in release 2.5, users can create a pool of ‘pachw’ workers to dynamically scale storage tasks and specify where these tasks are run. The net benefits are faster data ingestion and reduced costs.

Faster Data Ingestion

A common requirement of Pachyderm users is the need to ingest large quantities of data into Pachyderm for processing. Many times, that data is stored in an S3 bucket from an internal object store, etc. Ingesting this data could be extremely time-consuming. Release 2.5, however, automatically takes any request to load data and breaks it up into many smaller requests that are then executed in parallel across multiple worker pods. This approach results in significantly faster performance, and can be scaled horizontally as needed to accommodate even the largest ingestion scenario. If a storage operation took 4 hours, it can complete in 1 hour by adding 4 compute pods — that’s 75% faster!

Cost Savings by Optimizing GPU Resources

Many Pachyderm users have large-scale model training pipelines that require the heavy-duty processing power of GPUs. Because running on GPU nodes is significantly more expensive than running on CPU-only nodes, it is advantageous to be able to shut these nodes down as soon as possible after executing the pipeline’s user code. Pachyderm’s workers, however, perform several optimization tasks against a pipeline’s output after the user code is complete; the challenge in the past has been that these tasks don’t require GPUs, but will be run there nevertheless because that’s where the pipeline’s workers are.

Release 2.5 now has the ability to stop the pipeline workers as soon as they finish the user code execution. The output optimization tasks are now performed by separate pipeline-agnostic workers that can be run on CPU-only nodes. This allows the expensive GPU resources to be freed up more quickly, thus saving money. Additionally, these ‘pachw’ workers are constantly auto-scaled according to the number of tasks needing to be performed, which further minimizes resource utilization as a whole.

Summary

For data and ML engineers working with large and complex datasets, Pachyderm Enterprise is a flexible data management solution that automates and quickly scales data pipelines. Our 2.5 release brings a host of new features and improvements to Pachyderm, making it easier than ever to collaborate and build data pipelines. For more details please read the documentation.

Are you new to Pachyderm and searching for a data pipelining or MLOps solution? Take the next step and schedule a demo tailored to your environment.

Hewlett Packard Enterprise acquires Pachyderm to expand AI-at-scale capabilities with reproducible AI

January 12, 2023

Accelerate collaboration with Pachyderm 2.5

March 7, 2023

Reduce barriers to collaboration with Projects