Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

How to Pick the Ideal Data Pipeline for Your AI/ML Workflow

As AI/ML moves into the mainstream there’s no shortage of solutions trying to create the perfect, end-all, be-all machine learning toolkit. There is just one problem:

Almost all of them lack true flexibility needed for enterprise applications of machine learning.

Data science is an emerging field. Most of the tools were created by researchers, not enterprise software architects. That means they’re rough around the edges. You can download an amazing open source library that’s cutting edge and rockets your model efficiency to brand new heights, but if your AI pipeline doesn’t support it you’re out of luck. You’ll have to work around it or forget it.

Considering the Big End-to-End ML Platforms?

Cloud/SaaS based solutions, like Databricks or Amazon’s SageMaker, have a fantastic array of underlying hardware and slick interfaces. They bring together a lot of know-how from tech unicorns that have some of the most highly skilled AI teams on the planet.

The biggest challenge with most SaaS solutions is they’re utterly inflexible because the engineers writing their front and back ends have to make hard choices. What are the most popular frameworks? How fast can they support PyTorch and TensorFlow and all the little add-ons for both? When a new framework takes off running do they need to support that too? How about the 1000 or so libraries in Anaconda? What about an Anaconda alternative? Will all those get supported?

Catching Up with the ML Landscape at Light Speed

Eventually, they’re playing the same losing game as early search engine providers. Think Yahoo. Yahoo was the darling of the early Internet. While other search engines like AltaVista were overwhelmed with spam, Yahoo used human curators to pick the best of the web. That worked great when the web was a small place but as it grew the system couldn’t scale. It’s no surprise that they were eventually outpaced by newer, hybrid approaches, like Google PageRank which let humans do what they do, give meaning to information and machines do what they do, count links.

The same thing is happening in the AI/ML SaaS space. While those platforms have an early lead for developers, eventually they won’t scale. They’ll struggle to support every new framework and library they need to support. Forget about small projects that might make up the cornerstone of your project. The SaaS coders won’t ever get to it.

Fragmentation in the Machine Learning Niche

Beyond the cloud and SaaS space there are a number of early stage MLOps tools that came out of the early efforts of tech companies to do AI/ML at scale. Those are tools like Luigi and Airflow. They have good mind-share and great documentation.

The biggest problem is that they’re convoluted to setup and they’re incredibly rigid. If you don’t like Python, you’re in trouble with Airflow. It’s difficult to make the machine learning lifecycle work together seamlessly without a lot of coding.

There are lots of tools to choose from and choosing the wrong one creates massive headaches over the long run. If you have to rip up your whole architecture and revamp it later, you’re in for a lot of pain as you try to port your data, your code, your models, while keeping all the connections between them straight. Picking a tool that maximizes flexibility gives you the best shot and an MLOps engine that will stand the test of time.

Pachyderm Pipelines: the Elephant that Carries Your Data Science Workload with Ease

Inokyo is a checkout-less retail store creator that helps existing brick and mortar retail stores retrofit existing stores with AI to compete with the likes of Amazon Go. Like many teams, they started writing their own pipelining framework before realizing they needed to focus on building machine learning models to detect people moving in the stores — not write AI infrastructure pieces. They tried to switch to AirFlow, but wasted a month working with it before realizing how hard it was to make it easily work the way they wanted.

That is when Inokyo found Pachyderm. Pachyderm takes a different approach to the SaaS providers and the more rigid and early stage open source solutions. They focus on one thing:


Pachyderm’s team made good early decisions to leverage Docker and Kubernetes as the backbone of their data pipeline tools. Containers allow you to bring any framework, library or code you want to the table. It doesn’t matter if it’s a well-known framework or an obscure one just out of the research labs, if you can package it up in a container you can use it in your data pipeline.

The pluggable nature of Pachyderm and its agnostic approach to tools also makes all the difference in the world for teams trying to get a handle on complex design stages. Unlike AirFlow, the Inokyo team had Pachyderm up and running in three days because of its ease of use, containerization, simplicity and code-agnostic approach to data science operations.

Pachyderm uses easy to learn and understand YAML or JSON to define each stage of the pipeline. It uses a Git like syntax, with repos, branches and all of the things coding and AI teams are familiar with already. All engineers need to do is create a repo, backed by an object store they’re already familiar with, like Minio, Amazon S3, Google Cloud Storage, Azure Blob Storage, or any other cloud provider. Pachyderm automatically versions the data and keeps track of the entire history of changes and lets users build multi-step data-driven pipelines that leverage their versioned data.

Example: MP3 Audio Processing Pipeline

Let’s say an engineer wanted to do something simple for a dataset of MP3 files. They want to standardize all the names and then they want to convert them to WAV files so they can use Wavenet to study the audio for patterns. They could write a simple Bash script to strip spaces and convert uppercase to lower case. They then put that bash script in an Ubuntu container and tell the first part of the pipeline to call that script.

That’s it. They’re up and running.

Now the first stage runs and standardizes all the file names, moving them from an input repo with the raw data to an output repo.

Next they roll out another Ubuntu container, this time with ffmpeg installed and they write a little Python script to convert the MP3s to WAVs. Another bit of YAML and the second stage in their pipeline calls that script and the files get converted.

Now they’re ready to define stage three. They grab a Pytorch implementation of Wavenet and put it into a RHEL derivative container that’s got the latest Nvidia drivers baked in. They create a GPU node on the Google cloud for their cluster and tell Pachyderm to schedule the next stage on a GPU node. Now their model is training, crunching away at the WAV files and looking for patterns.

The easy, Lego brick style approach of Pachyderm makes it incredibly simple to get up and running, but even easier to use whatever the data scientists want to use in their workflow. You define the workflow with Pachyderm, not the other way around, like SaaS providers who dictate each step rigidly.

Enterprise NLP with Pachyderm

LogMeIn’s NLP team had a similar experience with Pachyderm. After struggling to build their own pipeline engine, they found it was taking seven weeks to process one chunk of audio with the biggest and most expensive instance they could spin up on AWS. With Pachyderm, they wrote some Python and Bash scripts to pre-process the data and Pachyderm did the heavy lifting, automatically splitting up the data and striping it across multiple containers to process in parallel.

LogMeIn cut their Natural Language Processing time from seven weeks to seven hours.

When it comes to picking the right AI/ML pipeline tool the key is simplicity and flexibility. By choosing a tool-agnostic approach, Pachyderm lets data scientists and MLOps engineers build their own workflow exactly as they want to build it, with exactly the tools they want to use.

In the next decade, we’ll likely see more and more of the ML toolset get standardized. But that takes time and in today’s wild west of MLOps, with new and exciting software coming out almost daily, a data science team needs to have the choice to bring whatever tool they want to the battle.

That’s what makes the differences between MLOps teams that push projects to production and the other 87% of projects that fail to make it out of the testing phase, wasting time, money and precious resources.

But if you let flexibility define your data pipeline tools, you’re starting from a position of power that lets you build whatever your team can imagine.