Scaling ML Data Pipelines With Kubernetes: Case Study

The Challenge

At the LogMeIn AI Center of Excellence in Israel, the company’s team deals with a lot of text, audio, and video that needs to get quickly processed and labeled for its data scientists to go to work delivering machine learning capabilities across their product lines. “Our job at the AI hub is to bring the best-in-class ML models of, in our case, Speech Recognition and NLP,” said Eyal Heldenberg, Voice AI Product Manager at the LogMeIn AI Center of Excellence. “It became clearer that the ML cycle was not only training but also included lots of data preparation steps and iterations, and we were changing preparation logic quite often! That lack of parallelization and scale really hurt our ability to get datasets to our researchers so they could get to the real work of testing, training and building models for our products.”

Pachyderm reduced our data processing time from 7 weeks, to just 7 hours.

“For example, one of our steps is a heavy processing of audio for sort of specific cleaning,” said Moshe Abramovitch, LogMeIn Data Science Engineer. “To process only one iteration of all our training data would sum up to seven weeks on the biggest compute machine AWS has to offer — and this is only one step. That means lots of unproductive time for the research team.”

We had started to look for a parallel compute solution that would be friendly with our technology stack and knowledge — Dockers and Kubernetes. We just wanted things to work without becoming experts in data pipelines.” That’s where Pachyderm came into the picture.

Why LogMeIn Chose Pachyderm

Speed and Parallelization

LogMeIn did a small POC at first and realized that instead of taking seven to eight weeks to transform their data, Pachyderm crunched that time down to an amazing seven to ten hours. LogMeIn’s research and business teams immediately saw the impact of Pachyderm’s speed and scale. ”Our models are more accurate, and they are getting to production and to the customer’s hands much faster,” said Heldenberg. “Once you remove time-wasting, building block-like data preparation, the whole chain is affected by that. If we can go from weeks to hours processing data, it greatly affects everyone. This way we can focus on the fun stuff: the research, manipulating the models and making greater models and better models.”

With Pachyderm, LogMeIn has scaled their pipelines tremendously because it can do so much of the work in parallel, without the team having to rewrite its software to take advantage of that parallelization. Pachyderm does the scaling and chunking for them.

“The largest pipeline that we ever ran is around 2,000 or 3,000 containers for a single pipeline,” said Abramovitch. “It’s something like 15 nodes, and each node has 96 CPUs.”

Pachyderm’s parallelism helps us run the transformer at scale. Basically, there is no limit of how many datum transformers we can run at once because as Pachyderm runs on Kubernetes, we can scale up to however much we want.

Flexibility

Pachyderm also delivers tremendous flexibility because it’s agnostic to the tools data scientists need to get their work done right. LogMeIn uses different ML frameworks like TensorFlow and PyTorch, and also utilizes in-house and open-source toolkits like Kaldi. The LogMeIn team wrote its own pre-processing tools to adjust it to the different frameworks.

Instead of thinking about building a monstrous infrastructure that takes months and months to do, LogMeIn was up and running with Pachyderm in a few days and delivering real impact on the business in weeks, as they reworked their pre-processing to take advantage of its capabilities.

You need to work with your existing tools, your existing languages, your existing dependencies. You want to invest as little as possible in learning, right? You just need stuff to be processed. And since Pachyderm utilizes really flexible tools like Docker and Kubernetes, it’s very democratizing.

Business Impact

“Not everyone on the AI research team understands what Pachyderm does, they just know it’s fast and delivers what they need, when they need it,” he observed. “That’s a good thing because it lets the data science team focus on what it does best — doing research and training models — instead of focusing on the infrastructure. “Everyone knows that Pachyderm is the processing framework, and it will just go fast.”

“The fact that we’re able to prepare our data so fast helped them to run a lot of training. Prior to using Pachyderm, we thought we’d never be able to execute those training sessions so fast. But because the data preparation process became so short, the research team was able to deliver much faster and create a lot of new models because of it.” When LogMeIn researchers come to them now, the AI Center of Excellence team knows what to say: “We’ll just do it in Pachyderm.”

First of all, I would recommend you evaluate Pachyderm. I already recommend it to my friends.

LogMeIn uses Pachyderm to quickly process and label data for Data Science teams

The Challenge

Why LogMeIn Chose Pachyderm

Speed and Parallelization

Flexibility

Business Impact

Transform your data pipeline