Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

How to Build a Robust ML Workflow With Pachyderm and Seldon

Enrico Rotundo

Winder AI

This webinar presents how to integrate Pachyderm with Seldon Deploy for a repeatable, monitored machine learning operations workflow. Enrico Rotundo at Winder AI presents this tech demo of an integration between Pachyderm and Seldon Deploy that is available on GitHub.

This example shows how a repeatable machine learning deployment cycle can be built with version controlled data pipelines and model monitoring for a fully auditable machine learning workflow

Huge thanks to MLOps consultancy Winder.AI for this great demo, and be sure to check out the Pachyderm - Seldon Deploy Integration on GitHub.

Webinar Transcript


Hi. My name is Enrico Rotundo, and in this video, I'll show you how to integrate and operate Pachyderm along with Seldon Deploy. Just a quick summary of these two systems first. Pachyderm is a data layer that brings together data version control and automated pipelines, providing lineage. You can build data processing workflows, where each step is tracked and version controlled automatically for you. Seldon Deploy is the enterprise product that extends Seldon Core open-source functionalities to deploy machine learning graphs. It comes with a dashboard, a REST API, and even a Python library for easy use. On top of that, the Seldon Alibi package has a set of tools to monitor closely your deployments for unwanted behaviours. Both of these systems are built on top of Kubernetes. That means they're perfectly capable of scaling out to any workload and can run natively on the cloud or on-premise environments.

The purpose of this demo is highlighting how Pachyderm and Seldon Deploy can be integrated to create an end-to-end machine learning deployment from model training to monitoring a live instance. Dealing with live data is no trivial task. Definitely, from a controlled lab, live data is always growing and changing at a very fast pace. That means you always want to keep an eye on what your models are doing and early detect any unexpected situation. Pachyderm takes a data-centric approach. It keeps datasets, models, and pipelines close to each other, so it is easy for your team to manage them. The benefits of integrating Pachyderm with Seldon Deploy are multiple. With lineage, you can always track a model down to its source dataset. And with automation, your entire pipeline reacts to input changes, and trains fresh models as soon as new data becomes available. And with monitoring, your ML live models are constantly watched over so that you can early detect problems.

In the first part of this video, I'll explain how to create a Kubernetes cluster and install practices. For the second part, which is a walk through an example credit institute that employs machine learning for its services. They're going to face real-life challenges in live machine learning. But thanks to Pachyderm and Seldon, they will manage them successfully. I'll show you how and why that is possible.

Before we dive into the installation process, I would like to give you a glimpse of the demo repository. And there are three folders, and the first two being related to the set of options we provide for running this demo. In fact, you can do this on Google Cloud or on a local machine. The third folder contains the tutorial or demo notebook. This will be shown in the second part of the video. The Docker container in these other folders that specialise in both training a specific set of models and deploy them into Seldon.

Part 1: Provision & Install the Integration

Let's start with the installation of the prerequisites for the demo. Head over to the GKE install folder, where you'll find a README with step-by-step instructions. Hereby we'll install Pachyderm and Seldon Deploy with related dependencies. Now, Seldon Deploy requires you to install a set of packages for a number of use cases that I'm going to detail soon. Before stepping into that, though, make sure you have a Google Cloud SDK client installed in your local machine and you have Docker running to extract Seldon Deploy installation files. And your account should also have enough quota to create the cloud resources we're going to need. For creating a cluster, you can just run this command, and here specifying a number of nodes S3. So we need 3.8V CPUs machine to run the full demo seamlessly.

Installing Seldon

Seldon Deploy needs Istio and Knative Serving. Now, Istio is a system that is in charge of routing inbound requests to Seldon ports. And Knative Serving is going to be used to spawn stateless workloads such as Seldon models. We're going to install these via Anthos, which comes along with the G Cloud suite. This is also going to create an HTTP load balancer.

Right after that, we can create a Istio gateway for Seldon. This is going to expose Seldon API on port 80 to the external world, which is the entry point we'll be using to deploy our models. Now, side note, if you're running on a private cluster, you may need to open a specific port. Here are the instructions if those are needed.

Now, Seldon ports need a way to communicate with each other, and Knative Eventing provides a message bus that does exactly that. So we're going to create also a Knative event broker, which will allow this communication more on ports. This is going to be creating some logs namespace. Right after that, we can just install the Seldon Core open-source component. We now download the installation files for Seldon Deploy. Install Elastic and Fluentd and Kibana to allow a Seldon Deploy to make logs searchable and presentable on the UI. Then, because we have already extracted Seldon Deploy installation files, we can just run the installation of them. We need to make sure that the Seldon namespace is feasible so that any container that is going to be deployed within that namespace is going to be visible on the Seldon Deploy UI as well.

As the last dependency to Seldon, we need to install the Analytics package. This is necessary to collect and process monitoring logs from components such as drift detector. Now, in order to reach the Seldon Deploy UI from our local machine, we'll have to run these port forward in a separate terminal. Right after that, just visit this URL, and where you will be able to activate your Seldon Deploy instance with a licence key. And then just use the UI.

Installing Pachyderm

It's almost time to install Pachyderm. First, we install the client locally. And then we're going to use Pachyderm native, deploying Google installer. So this requires you to create a storage bucket first. You might also want to use a vault storage size of 10 gigabytes that is going to be enough for this demo. It's going to be necessary to create custom cluster role bindings for your user account so that Pachyderm can proceed with its deployment. Once that is done, Pachyderm is going to deploy in one line of code.

Pachyderm also comes with a dashboard. In order to have that accessible from our machine, we'll have to again run a port forward in a separate terminal. When that is running, we can visit http://localhost:30080 and activate and use our Pachyderm dashboard. Last step is the creation of a cluster role binding for Pachyderm workers. This is necessary so that pipelines can create Kubernetes Secrets that Seldon containers are going to use to pull models at deployment time. That's all for the installation. We can now move on to the demo notebook.

Part 2: An Example MLOps Lifecycle for Finance

In what follows, I'll guide you through a story of CreditCo, an hypothetical credit institution that employs machine learning to predict individual's credit worthiness. CreditCo is a fast-growing company, rolling out its services first in the US market, with the intention of scaling out globally. However, as the company expands, its data changes as well, and so it needs a flexible ML infrastructure capable of monitoring live deployments and quickly adapting to changes by retraining models or as needed. What I'm going to show is a typical release, update, rollback, fix, and re-release cycle in the live system. If you follow the first part of the video, you should now have a Kubernetes cluster fully set up. To run through this notebook, you need a Kubernetes client installed, Pachyderm client as well, and the ability to pull Docker images from public Docker Hub into your cluster.

This demo works for Pachyderm 1.13, but it should also work with subsequent versions. And yeah, well, so we're dealing with data in this notebook. So I'll just start by populating the input data repository with the first CSV dataset. As you can see, Pachyderm automatically tracks this operation and creates a commit. Pachyderm dashboard is also displaying this newly populated repository along with some other details. All right, it's almost time to create our machine learning pipelines. Essentially, they will train a variety of models based on input dataset. The result is a set of artefacts that Seldon will deploy later on. Here's the income model training pipeline. The input to this pipeline is the income dataset hosted on this repository that I populated a few moments ago. This pipeline is based on a prebuilt container. It runs a scikit-learn, a popular Python library for machine learning. And it trains a logistic regression forecaster.

Income Model Forecasting

Now, without going into the details, this is a statistical model that takes inputs such as age, occupation, and others to predict whether an individual is eligible for credit. The next training pipeline generates a different type of model. This time, I use Seldon Alibi to create a model explainer. This is the first out of the three models that we're going to create, then we'll use as a monitor tools. And the way it works is that it takes a forecast and it points out to the features that impacted it the most. So in other words, given a prediction, it studies its vicinity in a mathematical space, working out how different features combine together and resulting in that specific forecast. Follow the Alibi documentation link for further details on this step.

Next, you create a drift detector training pipeline. Drift is a phenomenon that affects live machine learning systems, in which relation between model input and the predicted variable changes over time. A drifted model is likely to be unstable, so it is best to use a drift detector to check for that. It will throw an alarm if it detects something wrong. And it usually indicates it is good time to work on a new model. The last pipeline is for an outlier detector training. Now, this is another monitoring tool based on Seldon Alibi. It is in charge of inspecting incoming requests over deployment and compare their features with a train threshold, and flag those that have held out-of-ordinary values. Now, if we go check the Pachyderm dashboard, we can now see a more structure dependency graph.

Thanks to Pachyderm data-driven pipeline automation, as soon as a new dataset is added upstream, the training pipelines will fire and create fresh artefacts. Frequently, it is adding one more pipeline that collects all artefacts into a single location, which I call copy models. Instead of copying files physically, Pachyderm allows me to use symlinks and various substitutes that act as file pointers. You may want to read more about the empty files spec on Pachyderm documentation to understand how to keep these efficient. Current state is I have all models stored in one single location, but I want to decouple training and deployment because, for deployment, I do want to have a human in the loop that decides which version of the model should go live.

Here, I create a deployment service. A service is a special type of pipeline. It is designed to expose data to the external world rather than transforming it. In this case, I use it along with S3 inputs enabled. And this means that the deploy production branch of copy models exposes an S3 endpoint. I pass the S3 endpoint here, right, to the deployer script as a parameter. With this environment variable, Seldon will simply fetch the models from this S3 location and deploy them. 

Machine Learning Model Deployment

Okay, now that all pipelines are ready, CreditCo wants to go live. To deploy a model V1 in production, all I need to do is some branching work. So I moved this branch HEAD to master so that it points to the latest model version. Thanks to Pachyderm automation, the new for artefacts are going to be created automatically. And Seldon is now launching its ports. Once Seldon ports are ready, we can actually check its user interface, visualise a deployment.

Now, thanks to Pachyderm's provenance as well, CreditCo can anytime check the model version that has been deployed simply by matching the commit hashes from the model repository with one stored in the Seldon container. And first model is live, and the company's customers are sending credit assessment requests. Now, for demo purposes, I simulate the start with the test dataset, and I just prepared in advance. It contains a number of requests, and Seldon is going to return multiple predictions. So to do that, I use the predict upload JSON file functionality. And there you go. As part of the monitoring process, the company wants to check for a model drift. Now, Seldon has a specific view for that on their monitor in drift detection. And we see that this check returns negative.

CreditCo is committed in making sure its ML model is fair, specifically with respect to minority groups. So the company has included a model explanation check, and it's a monitoring process. The data team inspects a number of prediction using Seldon explainer. To do so, open your request tab and click on the explainer logo attached to each prediction. This page shows the most impactful variables for this given prediction. It seems that there is no reason for being concerned for now. In the meantime, the company has been growing a lot and its service became popular, and the user base has grown as well, including more and more non-US native individuals. And this is truly a good sign for business unit, but this new user group has now shifted the actual data distribution.

I demo this scenario by submitting a new test set to the model and then go back to the monitoring and drift detection tab to check and actually find that there is time. There is a drift being detected by our drift detector model. This means the current model version is no longer suitable for its purpose. CreditCo is now running in deep waters and needs a new model. To react quickly, it requires a dataset from a vendor. Dataset is specifically built to represented well, the new user group. To train a new set of models, all I need to do is push V2 dataset into the input repository.

Again, all I need to do to create a new deployment with latest model, I have to point the Deploy production branch to master and wait for the deployment to roll out. I just applied model V2 and it say it's running for a while, and now while the company gets more requests and simulate the scenario again, I'll submit yet another dataset. As part of the monitoring procedure, again, the team goes up in the request tab and checks for problems such as outliers. As usual, the team checks incoming requests. And this time, well, they found out that the outlier detector has thrown in a red flag. So an investigation starts and the user will model explainer to understand a little bit more about what's going on.

The investigation finds out that, well, this model version is now using gender to make predictions and systematically denying credit to female clients. This is a ticking bomb situation for CreditCo. If this went public, the company would have to deal with a serious scandal. CreditCo decided to roll back to model V1. Why? Because that didn't show this wrong behaviour. V1 models are still available in their Pachyderm repository, and it is enough to move the Deploy production branch HEAD to the target commit in order to roll back to the specific version. The emergency has returned, but CreditCo still needs to upgrade its models. This time, we've decided to deploy it in staging environment, validated, and only case push it forward to production. Seldon Deploy supports a variety of rollout strategies. I'm going to use a shadow approach that is incoming traffic is routed to both the main and shadow model. But while the main model operates as usual, the shadow predictions are kept private and for internal use such as validation.

So I create a specific deployment service for staging models. And I'm going to reuse a container created for the initial deployment. But this time, I'm running different scripts that is in charge of simply adding a shadow model to an existing Seldon deployment. I proceed just by running through the usual steps. So this time I'm pushing V3 through Pachyderm Graph. This will create new models. And then I deploy them in staging pipeline. I can double-check the deployed versions. And so I can see that the staging version actually matches the latest models while the main production model is-- well, it's been rolled back to the very first model training, right?

To fast forward a little bit, let's assume that the data team has validated some shadow behaviour and the company is happy with it and wants to promote it to production. Now, Seldon has a specific button for doing so. But hold on a second, I'm just going to use Pachyderm and preserve full provenance. How do I do it? Well, by simply pointing the production deployment branch to staging. And, well, this will promote the model automatically. As the last step, I'll just double-check the production model version with the model repository. Commit hash is to match, so I'm happy with it. And we reached the end of the demo.

That's it for this demo. We've seen how to manage and monitor ML deployments with confidence by using Pachyderm and Seldon Deploy. This notebook is publicly available on GitHub. Just check out the notes for further details or if you're interested in learning more about MLOps and Pachyderm, in general, or talking to me. My name is Enrico Rotundo. I'm with Winder Research. Check out Thank you very much.