Introducing Pachyderm 2.6

Jimmy Whitaker

May 17, 2023

Apache 2.0, Datum Batching, Squash Commits and More!

We’re thrilled to announce the release of Pachyderm 2.6!

A few months ago, we shared the news of Pachyderm’s acquisition by HPE. The 2.6 release marks the first step in integrating HPE branding into the enterprise version of our product. Pachyderm Enterprise will be referred to as HPE ML Data Management. You’ll begin to notice these changes across our enterprise documentation and within the enterprise version of Console. Go to HPE to learn about the ML Data Management product.

However, these changes don’t mean that the open-source version is disappearing! In fact, we’re excited to reveal that as of 2.6, Pachyderm is now licensed under the Apache 2.0 License, making it even more accessible for the open-source community. This update also brings a host of new features and improvements, such as datum batching, squash commits, enhanced Role-Based Access Control (RBAC), and much more. Keep reading to discover how these features can help you streamline your data management and collaboration processes.

Datum Batching

Pachyderm 2.6 brings datum batching, an performance optimization feature that enables processing multiple datums sequentially. This improves performance when handling a large number of small datums or slow-to-start user code. To enable datum batching, add the following to the transform section of your pipeline spec:

datum_batching: true

In your pipeline code, you can use this feature by requesting a new batch directly from your job. Here’s a simple example in Bash. Full Python support is coming soon!

transformation() {
  # Your transformation code goes here
  echo "Transformation function executed"
}

echo "Starting while loop"
while true; do
  pachctl next datum
  echo "Next datum called"
  transformation
done

For more information on datum batching and setting up your pipeline, refer to the detailed examples and documentation in the release notes.

Squash Non-head Commits

Pachyderm 2.6 introduces the squash commit command. This feature is inspired by the squash option in git rebase and can help clean up and simplify your commit history before sharing your work with team members. Squash commit allows you to combine all file changes in the commits of a global commit into their children and remove the global commit.

To squash a commit, use the following command: pachctl squash commit <commit-ID>. Please note there are some limitations and considerations when using the squash commit feature. For more details, refer to the examples and documentation in the release notes.

Mandatory Proxy

Pachyderm 2.6 introduces the Mandatory Proxy feature, designed to centralize all traffic on a single port safe to expose to the internet. This enhancement improves the security and manageability of your Pachyderm deployment. The Mandatory Proxy uses the Envoy proxy and offers customizable configuration options such as replicas, image, resources, labels, annotations, and service settings.

One significant advantage of the Mandatory Proxy feature is the elimination of the need for grpc:// and grpcs:// prefixes. Everything can now be accessed through http://pachyderm.your-site.com or https://pachyderm.your-site.com. This simplification also applies when connecting via the JupyterLab Mount extension or even in our new datum batching capability.

Each replica can handle up to 50,000 concurrent connections, with an affinity rule preferring to schedule proxy pods on the same node as pachd. The service configuration allows you to set the type of service (ClusterIP, NodePort, or LoadBalancer), load balancer IP, and the ports for serving plain HTTP and HTTPS traffic.

For HTTPS traffic, enable TLS by providing a secret containing “tls.key” and “tls.crt” keys. Note that this option is incompatible with legacy ports.

New in RBAC: Project Roles

In Pachyderm 2.6, we’ve updated the Projects RBAC with new roles, offering more fine-grained access control. All users now have the PROJECT_LIST_REPO and PROJECT_CREATE_REPO permissions by default. To view your access level, run the command pachctl list project and check the ACCESS_LEVEL column.

The new project roles include:

ProjectViewerRole:
1. Permission: PROJECT_LIST_REPO
ProjectWriterRole:
1. Inherits all permissions from ProjectViewerRole
2. Permission: PROJECT_CREATE_REPO
ProjectOwnerRole:
1. Permission: PROJECT_DELETE, PROJECT_MODIFY_BINDINGS
ProjectCreatorRole:
1. Permission: PROJECT_CREATE

Console Updates

At Pachyderm, we continually strive to improve the user experience, and with version 2.6, we have made several enhancements to the Console UI. These updates aim to make it even more user-friendly and efficient for managing and understanding your data pipelines and repositories. You’ll also see these improvements in the ML Data Management product.

View previous versions of a specific file.

One of the key improvements in the Console UI is the revamped file browser. We have made it more intuitive and easier to navigate, allowing you to quickly browse through your files and efficiently manage your data. The enhanced file browser provides better visibility into your repository structure, making it simpler to locate and access the files you need.

Additionally, we have introduced detailed information for the jobs view in the Console UI, giving you valuable insights into the performance of your pipelines. This added visibility enables you to optimize your data processing and ensure the smooth operation of your Pachyderm cluster.

JupyterLab Pipeline Extension (Alpha)

The Pachyderm JupyterLab Pipeline Extension allows you to push notebook code directly to a pipeline.

A year ago, we introduced the JupyterLab Pachyderm Mount Extension, which significantly improved the pipeline development experience for Pachyderm users by allowing them to access their Pachyderm repositories directly within JupyterLab. Building on the success of the Mount Extension, we are excited to announce the introduction of the PPS Extension in Pachyderm 2.6, taking the data science workflow to a whole new level.

The PPS Extension enables you to create pipelines directly from the Jupyter notebooks where you are exploring and analyzing your data. By integrating pipeline creation with your Jupyter Notebooks, you can seamlessly transition from data exploration to pipeline deployment, streamlining your workflow and enhancing productivity.

Although the PPS Extension is currently in its early stages, we are excited to share it with our users. One of the main benefits of the PPS Extension is the ability to perfect your code using local data that’s been laid out in the same way as it would be inside a Pachyderm pipeline. This allows for a more accurate development environment and smoother transition to deploying your pipeline.
As the PPS Extension continues to evolve, we look forward to refining its capabilities and user experience. We encourage you to try it out and share your feedback, as your input is invaluable in helping us improve this feature and make it an essential part of your data science workflow.

Docs Information Architecture

We’ve updated our documentation site’s information architecture to utilize categories based on a user’s intention, organized by a natural progression that reflects how they might learn about and experience Pachyderm. That progression looks like this:

Getting Started > Set Up > Manage > Prepare Data > Build Pipelines & DAGs > Export Data […]

Transitioning from a content-type based organization system (How Tos, Reference, Concepts) to one that prioritizes user intention should add clarity, improve discoverability, and make the documentation more accessible to users with varying levels of familiarity with Pachyderm.

Similar changes are also reflected in the HPE ML Data Management documentation.

Conclusion

We hope these new features and improvements in Pachyderm 2.6 will help you manage your data more efficiently and collaborate more effectively. For more details on any of these items and other improvements, check our release notes. As always, we appreciate your feedback and look forward to hearing about your experiences with Pachyderm 2.6!

Are you new to Pachyderm and seeking a data pipelining or data versioning solution? Take the next step and schedule a demo tailored to you, and discover how Pachyderm can revolutionize your data management and collaboration experience.

Hewlett Packard Enterprise acquires Pachyderm to expand AI-at-scale capabilities with reproducible AI

January 12, 2023

Introducing Pachyderm 2.6

May 17, 2023

Apache 2.0, Datum Batching, Squash Commits and More!

Datum Batching

Squash Non-head Commits

Mandatory Proxy

New in RBAC: Project Roles

Console Updates

JupyterLab Pipeline Extension (Alpha)

Docs Information Architecture

Conclusion

Hewlett Packard Enterprise acquires Pachyderm to expand AI-at-scale capabilities with reproducible AI

22 Essential Pachyderm Commands

Announcing Pachyderm Release 2.4

May 17, 2023

Share

Apache 2.0, Datum Batching, Squash Commits and More!

Datum Batching

Squash Non-head Commits

Mandatory Proxy

New in RBAC: Project Roles

Console Updates

JupyterLab Pipeline Extension (Alpha)

Docs Information Architecture

Conclusion

Hewlett Packard Enterprise acquires Pachyderm to expand AI-at-scale capabilities with reproducible AI

22 Essential Pachyderm Commands

Announcing Pachyderm Release 2.4