Reproducible Data Science in Pachyderm Hub

|

Working with tablular data on Pachyderm Hub hero image

The Reproducibility Crisis. Many in the Machine Learning community know it well. You have a new idea and want to tackle the state-of-the-art, but first you need to baseline the current state-of-the-art model. You have the code that produced the model and even the original dataset used in the prior work, but for some reason you cannot seem to get the same results.

Suddenly, you find yourself on a different quest entirely - reproduce a model that was already built. Until recently, reproducibility was a research area all its own. Even worse, it was often an afterthought in the machine learning community. People who do think about it typically use combinations of libraries and manual processes, but these create a reproducibility crisis of their own as people struggle to maintain quickly created scripts. There is a better way.

In this post, we’ll show you how to use Pachyderm to make sure your machine learning pipelines are reliable and repeatable. We’ll spin up a cluster on the brand new Pachyderm Hub which gives us a Kubernetes cluster with Pachyderm deployed in minutes, with no Kubernetes expertise needed.

Introduction

So why use Pachyderm?

Pachyderm’s platform delivers a strong foundation for versioned data science. It does this through its automatic, Git-like versioned file system and its powerful pipeline processing engine. The system runs on Kubernetes, making it highly scalable while giving you reproducibility throughout your entire data science workflow.

The hosted version, Pachyderm Hub, means that you don’t even need to deal with setting up Kubernetes to use it. You bring your data and code, Pachyderm Hub scales it for you!

So let’s get started. We’ll spin up our cluster and train a regression model to predict housing prices. We’ll include most the stuff you need to get moving right here, but for a more detailed walkthrough check out the full example on Github.

Set up Pachyderm Hub

First, install the Pachyderm CLI on your computer, so you can communicate with the cluster. Verify it installed successfully by running:

$ pachctl version --client-only
COMPONENT	VERSION 
pachctl		1.11.0

Now we’ll create our cluster. Visit the hub page, and create a free account by signing in with either Google or Github. This handles any authentication key generation for you when creating a cluster.

After logging in you will see all running Hub workspaces:

Pachyderm Hub Workspaces

We’ll create a new Pachyderm cluster by clicking the “Create New Workspace” button. Once the cluster is ready, select “Connect” to reveal the 3 commands we need to run to interact with our Pachyderm cluster.

Copy these into your terminal, and everything is ready to run.

Add Data to Pachyderm

The Pachyderm File System (PFS) makes it easy to add data to Pachyderm. We will:

  1. Create a data repository
  2. Push data to that repository

Behind the scenes files are uploaded to a key-value store (think s3 or minio), meaning we can put essentially any type of file or data into a data repository.

The Dataset

We’ll use a standard dataset, the Boston Housing Dataset. It’s made up of data captured by the U.S. Census Service in the 1970s on and is used to predict housing prices based on these features.

I use a reduced version of this dataset (only 3 features + the target) to keep the example quick, but everything described here would run exactly the same on the full 14-feature dataset. I’m also only using 100 examples initially (you’ll see why later on).

Here’s a quick description of the dataset we’re using:

FeatureDescription
RMAverage number of rooms per dwelling
LSTATA measurement of the socioeconomic status of people living in the area
PTRATIOPupil-teacher ratio by town - approximation of the local education system’s quality
MEDVMedian value of owner-occupied homes in $1000’s

Sample:

RMLSTATPTRATIOMEDV
6.5754.9815.3504000.0
6.4219.1417.8453600.0
7.1854.0317.8728700.0
6.9982.9418.7701400.0

Create a Data Repository

In order to add the dataset to Pachyderm, we first need to create a data repository to put it in. We can think of this as a directory or an s3 bucket with versioning information on every file in it. Here, we’ll create a data repository called housing_data to put the dataset using the Pachyderm CLI.

$ pachctl create repo housing_data
$ pachctl list repo
NAME 		 CREATED 	   SIZE
housing_data 3 seconds ago 0 B

Add data to the Data Repository

Adding data to the repository is equally simple. Pachyderm uses the convention <repo>@<branch>:<file_name> to know where to place the file. Here our branch will be master, which is the main branch. There are many correlations with Git, such as the ability to branch your data.

$ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-1.csv

We can inspect that the data is in the repository by looking at the files in the repository.

$ pachctl list file housing_data@master
NAME 					TYPE SIZE
/housing-simplified.csv file 12.14KiB

Create a Processing Pipeline

Processing pipelines are what we’re typically used to writing as Data Scientists, whether it’s data pre-processing, data cleaning or training and testing, this is where the code meets our data. Here, we’ll create one simple pipeline that does three things: data analysis, model training, and evaluation.

Data Analysis

There’s plenty of blogs on this data analysis, so I won’t spend too much time here, but essentially, machine learning is very much a garbage in, garbage out game. There’s no substitute for knowing your data, and I usually spend a lot of time in this step. Here we’ll use pandas data frames and seaborn.

For the housing prices data, we’ll look at 2 things:

  • A correlation matrix that shows what features correlate with the target class and also which ones correlate with each other.

  • A pair plot visualizing this correlation. I like the pair plot because it allows me to see if there are a large number of outliers and inspect them if needed.

Correlation MatrixPair Plot

Train a Regression Model

To predict home value in dollars, we’re going to be training a regression model. For this example, we’ll use a Random Forest Regressor Ensemble from sklearn and train it with 10-fold cross-validation. The code used to train the model is shown below.

# Train a Random Forest Regression model
reg = ensemble.RandomForestRegressor(random_state=1)
scores = cross_val_score(reg, features, targets, cv=10)
print("Score: {:2f} (+/- {:2f})".format(scores.mean(), scores.std() * 2))

Evaluate the Model

Once our regression model is trained, we need to know how well it will generalize. A learning curve is a useful graph showing the relationship between the training set size and the performance of the model. It can help us understand if adding more data is useful for improving the model. See the GitHub code for how to create the learning curve.

Create a Pachyderm Pipeline from Our Code

Pachyderm pipelines connects code built inside Docker images with the data that is contained in PFS repositories. Because the code is containerized, this allows for scalability and distribution of work across a cluster with little overhead. The Pachyderm pipeline is defined in a json specification file. The one we will use for our regression pipeline is shown below.

{
  "pipeline": {
    "name": "regression"
  },
  "description": "A pipeline that trains produces a regression model for housing prices.",
  "input": {
    "pfs": {
      "glob": "/*",
      "repo": "housing_data"
    }
  },
  "transform": {
    "cmd": [
      "python", "regression.py",
      "--input", "/pfs/housing_data/",
      "--target-col", "MEDV",
      "--output", "/pfs/out/"
    ],
    "image": "pachyderm/housing-prices:1.11.0"
  }
}

For the input field in the pipeline definition, we define input data repo(s) and a glob pattern which tells the pipeline how to map data into a job. The image defines what Docker image will be used, and the transform is the command run once a pipeline runs.

We can deploy the pipeline by running:

$ pachctl create pipeline -f regression.json

Once this pipeline is created, it starts a new job to execute our three steps and writes the output to a new PFS repo named after the pipeline (regression).

Note: /pfs/out/ maps to regression outside the container.

We can inspect the status of jobs created by pipelines by running:

$ pachctl list job
ID 								 PIPELINE 	STARTED 	   DURATION   RESTART PROGRESS DL  UL 	   STATE
299b4f36535e47e399e7df7fc6ee2f7f regression 23 seconds ago 18 seconds 0 1 + 0 / 1 2.482KiB 1002KiB success

Note that while we could have split each of the steps into a separate pipeline, for simplicity in this example, we left them together.

Initial Run

When the initial run of our pipeline is completed, we can view the artifacts created and download them.

$ pachctl list file regression@master
NAME 								TYPE 	SIZE
/housing-simplified_corr_matrix.png file 18.66KiB
/housing-simplified_cv_reg_output.png file 62.19KiB
/housing-simplified_final_model.sav file 1.007KiB
/housing-simplified_pairplot.png file 207.5KiB

$ pachctl get file regression@master:/ --recursive --output .

Updating our Dataset

Now, let’s inspect the output of our model. When we look at the learning curve, we can see that there is a large variance in the cross-validation scores. This indicates that our model could benefit from the addition of more data.

If you recall, I only used 100 examples from the Housing Dataset. Let’s see what happens if we add more examples.

Since we already have our processing pipeline connected to the data repository, I only need to update the data and Pachyderm will detect that the pipeline needs to rerun. This shows one of the most powerful aspects of Pachyderm - data-driven pipelines.

We’ll update our dataset using the following command:

$ pachctl put file housing_data@master:housing-simplified.csv -f data/housing-simplified-2.csv --overwrite

This creates a new commit to the housing_data repository, which in turn, automatically starts a regression job. We don’t have to resubmit the pipeline, because Pachyderm’s event-driven architecture notifies the pipeline that the input data has changed.

Note: We don’t actually lose our old data. It is in a previous commit of the housing_data repository, which we can roll back to at any time.

When the job is complete we can download the new files showing our model improved, due to the additional data.

Model Lineage

In our simple example, we only have 1 pipeline with 2 experiments, which isn’t difficult to keep simple. However, in production applications there may be many iterations on both data and pipelines. And that is where things get very messy. No matter there’s always cases of losing the version of the dataset that the model was trained on, or not being able to reproduce the results that someone achieved on their own machine.

Since Pachyderm versions data as well as the pipelines associated with them, it can tell us exactly what version of the dataset and pipeline created a model or any other artifact. We can use the commit IDs created automatically to inspect any job how the artifacts were created.

$ pachctl inspect commit regression@f59a6663073b4e81a2d2ab3b4b7c68fc
Commit: regression@f59a6663073b4e81a2d2ab3b4b7c68fc
Original Branch: master
Parent: bc0ecea5a2cd43349a9db3e89933fb42
Started: 7 minutes ago
Finished: 7 minutes ago
Size: 4.028MiB

Provenance: __spec__@5b17c425a8d54026a6daaeaf8721707a (regression) housing_data@a186886de0bf430ebf6fce4d538d4db7 (master)

Conclusion

We hope that this provided as a useful introduction to Pachyderm. It is an incredibly powerful platform that can be used to achieve reproducibility, construct data driven pipelines, and understand the lineage between all of the components as pipelines and datasets grow over time. Moreover, using Pachyderm has never been easier with Pachyderm Hub. By removing the complexity of Kubernetes, Pachyderm Hub significantly lowers the barrier to entry for those who need scalable, reliable, and reproducible data science.