Everything you need to know to get started with the newest version of Pachyderm
The Pachyderm 2 Release Candidate is now available for installation on Enterprise and Community Edition clusters. Pachyderm 2 is our biggest release ever. Pachyderm 2 increases performance, reduces resource consumption, allows for uniform and secure cluster management policies, and extending Pachyderm’s lead as the data foundation for machine learning.
Pachyderm 2 has many new features that provide all those benefits, but the most important for you, as a person running existing Pachyderm 1 clusters, are the new storage layer, enterprise management options, and the changes to default Pachyderm 1 behaviors. Migrating your existing Pachyderm 1 cluster to Pachyderm 2 will take planning and testing. Running your pipelines and maintaining your clusters will also require learning some new commands and concepts. Enterprise customers should contact their customer engineering representative with any questions or requests for assistance, while Community Edition users should utilize the Pachyderm Slack channel.
New Storage Architecture and FileSets
The most important difference between Pachyderm 2 and Pachyderm 1 is the storage architecture. Files are broken into 64-byte chunks and the chunks are deduplicated across all files. Commits rely on precomputed FileSets that make skipping datums much faster and more efficient.
To take advantage of this new architecture, your object storage needs to be migrated from the Pachyderm 1 format to the Pachyderm 2 format. This is accomplished using a migration tool that will non-destructively migrate your object storage from your Pachyderm 1 bucket into a Pachyderm 2 bucket for use by your Pachyderm 2 cluster.
Pachyderm Enterprise Management
Pachyderm 2 includes new Enterprise Management options which allow for site-wide configuration of licensing, authentication and access control, as well as single-point Pachyderm configuration synchronization. With one command, your users can now gain access to every cluster in your enterprise, with the appropriate level of access control in each cluster. It also allows for authentication against any OIDC provider.
This feature may require the configuration of a dedicated, separate namespace in one of your Kubernetes clusters where the enterprise management server will run, if you wish to manage more than one Pachyderm cluster. It will configure its own database resources.
Pachyderm 2 introduces a new web UI, the Pachyderm Console, that replaces the Dashboard in Pachyderm 1. Not all Dashboard functionality is supported in the initial release of Console. The Pachyderm Console’s initial release is focused on Pachyderm operations, allowing for easy access to job and log information in your pipelines.
The Pachyderm Console’s single ingress configuration is easier to set up than the Pachyderm Dashboard’s. Each Pachyderm Console requires some configuration of Enterprise Management to allow for secure access to the pachd service.
In Pachyderm 1, tracing data lineage across multiple pipelines involved using job IDs and commit IDs that varied within and across pipelines in your directed acyclic graphs (DAGs). This required using commands like pachctl flush job or flush commit using an input commit’s identifier to trace the subvenance hierarchy down through the processing graph.
In Pachyderm 2, there is a single, Global Identifier for all the commits and jobs related to a transaction. Using the pipeline@id syntax, the pachctl inspect command allows you to resolve a Global Identifier for a commit or job to a particular pipeline. The Pachyderm Console makes extensive use of Global Identifiers to make debugging and operations easier than ever.
In Pachyderm 1, the embedded pachctl deploy command was used to deploy, configure, and update Pachyderm components in Kubernetes, with customization via command-line flags or editing of the generated manifests.
In Pachyderm 2, the pachctl deploy command is replaced by a fully-supported Helm 3 chart with a deep level of customization.
Changes to Pachyderm 1 Behavior
Empty directories will not be present in repos
In Pachyderm 2, directories are implied from the paths of the files. They are no longer explicit objects in the file system. A side effect of this is that empty directories will not be created in input or output repos. Any pipeline code you may have that relies on checks for an empty directory will need to be modified.
Default upload behavior changes from append to overwrite
pachctl put file and associated APIs now overwrite files by default.
You will need to modify any code that depends on the Pachyderm 1 default of “append” to explicitly append to files. Similarly, any scripts you may have which depend on pachctl put file defaulting to “append” will need to be modified to include the -a flag for Pachyderm 2. Finally, any scripts you may have which use pachctl put file -o to force overwrites on Pachyderm 1 will need to be modified, as the -o flag is no longer supported.
Full paths in repos must be specified when uploading
In Pachyderm 1, the command pachctl put file myrepo@master -r -f ./mydir/ would result in myrepo containing a mydir directory at the top, with all of the files from that directory placed underneath.
In Pachyderm 2, this same command results in the files being placed at the root of the repo. To achieve the same results as in Pachyderm 1, you must specify any such commands as pachctl put file myrepo@master:/mydir/ -r -f ./mydir/ instead, explicitly listing the path within the repo.
Automatic file splitting no longer supported
In Pachyderm 1, record-based files could be automatically split into records using a pachctl flag or appropriate APIs.
Pachyderm 2 no longer supports the flags or APIs to split a file into multiple files based on content. If you are using this flag, you will need to implement a mechanism to perform this splitting via user code.
New Spouts architecture
The Pachyderm 1 Spouts architecture has reached end of life. Pachyderm 2 Spouts rely on Pachyderm APIs or the pachctl command to create commits in the Spout’s output repo.
Existing Pachyderm 1 Spouts will need to be ported to Pachyderm 2 before migration.
End of life for build pipelines
Build pipelines are no longer supported in Pachyderm 2.
Standby option replaced by autoscaling
The standby option in the pipeline spec has been superseded by the autoscaling option. The specs for pipelines which use the standby option must be edited to replace it with a value for autoscaling.
Elimination of merges and new single-datum provenance
In Pachyderm 1, multiple datums from the same input repo in the same job in a pipeline could write to the same output file and the results would be merged, with indeterminate ordering of results in the merged file. In Pachyderm 2, if two datums from the same repo write to the same output file, it will raise an error.
If your pipelines rely on the Pachyderm 1 merge behavior, you will need to rewrite them to output to separate files. You may use filename metadata to group them for downstream use. If you need the files merged into a single file, you will need to add a pipeline that groups the files into single datums using that metadata and merges them using your code.
Helm charts now used for deployment
The pachctl deploy command is no longer supported in Pachyderm 2. It has been replaced by a Helm 3 chart.
Getting Ready for Pachyderm 2
Try out Pachyderm 2
Pachyderm 2 is ready for testing in your own Kubernetes clusters. We encourage you to set up a non-production test cluster today to get used to some of these new features and concepts.
Talk to us about migrating your Pachyderm 1 clusters
Enterprise customers should contact their Customer Engineering rep to put together a migration plan to get your Pachyderm 1 clusters running on Pachyderm 2. Community Edition users should utilize the Pachyderm Slack channel for any questions or requests for assistance.