Accelerating NLP With Parallelization: Case Study

The Challenge

The biggest challenge for the Speech Technologies team was productionizing and scaling of a working prototype. The initial prototype had monolithic scripts that processed their clients’ data linearly. The scripts included lots of discrete steps and that meant when something went wrong it was difficult to debug. Tracking down which step failed and why something took time.

Today, our workload runs in under a day due to incremental processing, thanks to Pachyderm. We were able to push out more models by training and serving them in parallel.

It was also inefficient as they had to pre-allocate resources to the pipeline. The problem was that not all stages of the pipeline needed the same resources. Some steps were memory intensive while others were CPU heavy. Still others were disk intensive. But because the team processed everything with a single big script, they were stuck pre-allocating resources which could have been leveraged for other tasks. They wanted a system that would allow a quick path to productionizing prototypes, easily scalable and can rightsize the containers based on the workloads they were running and quickly release those resources when they finished their task.

Since data was processed linearly, it meant there were scaling limitations, such as having to reprocess large amounts of data that can take weeks to finish. They needed a technology that would parallelize their data processing and scale it to brand new heights and that’s where Pachyderm came into the picture.

Why Fraunhofer IAIS Chose Pachyderm

Simplicity

Pachyderm delivered a simple way to move from prototype into production, while providing mass parallelization to the Fraunhofer IAIS Speech Technologies team.

It allocated only the resources each stage in the pipeline actually required. Since Pachyderm uses Kubernetes on the backend, it can quickly scale up one stage in the pipeline, creating new containers with exactly the right resources, and then release them as soon as that stage finishes.

It also parallelized the data processing itself, slicing up the text and audio data into different chunks. That let the scripts work on different parts of the data at the same time, which dramatically sped up processing time from days or weeks, to a few hours. Now some customers are getting updated models daily, as the system swiftly pulls in new data, trains on it and delivers a newly updated model. The team doesn’t have to think about how to recreate all the steps correctly. They can put data in one end of the pipeline and output a model on the other end. If anything goes wrong, it helps to debug a pipeline stage in isolation, where inputs and outputs are well defined in Pachyderm’s JSON file.

The Pachyderm platform also gave them tremendous flexibility. Too often platforms lock you into the limited set of tools and languages they support. Maybe they only support Python or R, but Pachyderm supports any tool, or framework, or library you can package up into a container. That let the Speech Technologies group build multistage pipelines in different languages, using Bash in one, Python in another and C/C++ in a different stage, each language doing what it does best.

Conclusion

The Fraunhofer IAIS Speech Technologies team likes Pachyderm’s data driven approach. For them, it is a paradigm shift in how to automatically process large amounts of data.

Pachyderm’s data first approach fits the new needs of machine learning engineers everywhere. In the end, it’s Pachyderm’s powerful data driven pipelines and mass parallelization that helps advanced research teams adapt to the new realities of data-driven software development fast.

The data centric approach of the platform was a game changer for us.

Fraunhofer uses Pachyderm to scale understand intricate medical terminology or complex legal jargon

The Challenge

Why Fraunhofer IAIS Chose Pachyderm

Simplicity

Conclusion

Transform your data pipeline