Pachyderm 1.10 - S3 Gateway Expansion & Kubeflow Support

|
Pachyderm now supports Kubeflow

Today we’re excited to announce that with Pachyderm 1.10, you can now integrate Pachyderm repos with Kubeflow Pipelines. Pachyderm’s S3 Gateway feature lets you directly leverage Pachyderm’s data lineage capabilities right inside your Kubeflow environment.

Along with the rest of 1.10, this feature signals our commitment to integrate more deeply with the vast and thriving community of data science tools. Ever since we started collaborating with Kubeflow back in 2017, we’ve seen the great potential of the project and we’ve enjoyed working with them in the community and the customers we share.

Kubeflow and Pachyderm, Better Together

Platforms like Kubeflow run their own set of Kubernetes pods. In previous versions of Pachyderm, the only way to deploy the Pachyderm S3 was as a standalone service. It allowed users to read and write data, but didn’t manage data lineage.

In 1.10, we’ve created the ability to deploy the Pachyderm S3 gateway as a sidecar to Kubeflow pipelines. Running as a sidecar, Pachyderm’s S3 gateway service can now directly access and understand the Pachyderm pipeline’s data lineage and history. Pachyderm running jobs read and write data from an S3-like service.

Pachyderm and Kubeflow Sidecar Diagram

If you’re building pipelines in Kubeflow you can leverage Pachyderm’s powerful data versioning and lineage capabilities directly from Kubeflow just like they would with any other object storage endpoint. Even better, it doesn’t matter how your Kubeflow pipelines are written or what they’re doing, as long as they have that s3-endpoint the two platforms should work smoothly together.

Let’s dive in and get to work applying what we’ve talked about so you can see it for yourself. We’ll show you first hand just how easy it is to deliver true data lineage to your Kubeflow pipelines.

Step 1 - Install and deploy Kubeflow and Pachyderm

Part of what makes Pachyderm and Kubeflow work so well together is that they’re built on Kubernetes, which means they can run virtually anywhere. While both have their own deployment instructions for various infrastructures, this instructional will be based on Google Kubernetes Engine (GKE). Before continuing, make sure following installed on your local machine:

Prerequisites:

Deploy:

To make it simple, we created a bash script specifically for this post and you can use it to deploy Pachyderm and Kubeflow together on GKE in no time. However, if you prefer to do this all on your local machine, or any other infrastructure, please refer to the links below:

Once everything is deployed the easiest way to connect to kubeflow is via port-forward: kubectl port-forward svc/istio-ingressgateway -n istio-system 8080:80

Working setup check:

  1. kubectl get pods -n kubeflow returns running pods.
  2. pachctl version returns both pachctl and pachd versions.

Step 2 - Checking in your data

With everything configured and working, it’s time to grab some data and then check it in to Pachyderm. To do so, download a mnist.npz dataset to your local machine, and proceed with checking it into Pachyderm.

Download the mnist.npz file to a blank directory on your local machine:

curl -O https://storage.googleapis.com/tensorflow/tf-keras-datasets/mnist.npz

  1. Create a Pachyderm repo called input-repo:

pachctl create repo input-repo

  1. We check-in our mnist.npz to input-repo.:

pachctl put file input-repo@master:/mnist.npz -f mnist.npz

This command copies the minst dataset from your local machine to the Pachyderm repo input-repo and Pachyderm will assign it a commit ID. Congratulations! Your data now has a HEAD commit, and Pachyderm has begun version-controlling the data!

Confirm that the data is checked-in by running the following command:

➜ pachctl list file input-repo@master
NAME            TYPE SIZE
/mnist.npz file 10.96MiB

Step 3 - Deploying code to work with MNIST

Now that the data is checked in, it’s time to deploy some code. In the same directory run the following:

git clone https://github.com/pachyderm/pachyderm.git && cd pachyderm/examples/kubeflow/mnist/

Next, let’s take a look at the Pachyderm pipeline spec file pipeline.yaml as this is the start of our explainable, repeatable, scalable mnist-pipeline.

pipeline:
  name: mnist
transform:
  image: pachyderm/mnist_pachyderm_pipeline:v1.0.0
  cmd: [ /app/pipeline.py ]
s3_out: true  # Must be set
input:
  pfs:
    name: input      # Name of the input bucket
    repo: input-repo # Pachyderm repo accessed via this bucket
    glob: /          # Must be exactly this
    s3: true         # Must be set

For the most part, it’s a standard Pachyderm pipeline spec. We have our usual transform step that declares a docker image, and tells it to run pipeline.py. Below that, is where you’ll see a few of the new S3 gateway features being used, primarily, the s3_out: true and s3: true. The s3_out: true allows your pipeline code to write results out to an S3 gateway bucket instead of the typical pfs/out directory. Similarly, s3: true is what tells Pachyderm to mount the given input as an S3 gateway bucket.

Next, open up the pipeline.py file, and you’ll see that apart from a few Kubeflow-specific bits, lines 1-71 is a pretty standard MNIST training example using Tensorflow.

Kubeflow users will notice that from lines 73 down, we’re just declaring a Kubeflow pipeline (KFP) using the standard Kubeflow Pipelines SDK. Then, On line 88, we call create_run_from_pipeline_func to run the KFP with a couple additional arguments which declare the S3 endpoints being provided by the Pachyderm S3 gateway. In our case, this will be the input-repo that contains our MNIST training data, and then the KFP will output the trained back out through the Pachyderm S3 Gateway to our output repo.

Step 4 - Data Lineage in action

That takes care of the code. Next, let’s move on deploying everything so we can train our model.

pachctl create pipeline -f pipeline.yaml

You can keep an eye on progress by either running pachctl list job or by looking at the Experiments tab in the Kubeflow Dashboard.

Kubeflow Dashboard

Once the job (or “run” in kubeflow terms) is complete, you should see a model file in your Pachyderm mnist repo (created automatically when we ran the pachctl create pipeline). You can check yourself with:

pachctl list file mnist@master

➜ pachctl inspect commit mnist@master
Commit: mnist@a64bfc6da6714c23a44db5c984850db2
Original Branch: master
Parent: 60d23801e61f4b3c975b7f0be1f9f208
Started: 56 seconds ago
Finished: 24 seconds ago
Size: 4.684MiB
Provenance:  __spec__@bd9f2665811f42ea9b1e3a56dc70c0b1 (mnist)  input-repo@a8dbe24ea9c047129a170cae95aa292a (master)

Notice the Provenance line. That right there is proof you just version-controlled your data as well as the machine learning model that was created from it.

Because you incorporated data lineage into your workflow using Pachyderm, you can actually restore previous versions of your data and model. That’s incredible when you consider how often you have to answer the question “What exact data was used to train that model?” Thanks to Pachyderm, you can answer that with just one command. And when an auditor asks, “what data was used to train that model 3 months ago?” Well, that’s just one Pachyderm command away too.

Interested in private demo of what you just read? Let us know

About the Author

Nick Harvey

Nick Harvey is the Head of Marketing at Pachyderm and a father of two. He's spent the last decade working on open source, machine learning, and all things Kubernetes.