Reproducible Data Pipelines For Analyst Teams: Case Study

In one day, we ended up automating the entire process all the way to the reporting. The data preprocessing, the hyperparameter search, model tuning, model selection, and even some of the inference testing were all automated using Pachyderm.

The Challenge

To deliver highly effective machine learning solutions to its customers, Digital Reasoning must process large volumes of disparate, disorganized, and seemingly unrelated information that constantly changes. Jimmy Whitaker, Manager of Applied Research, and his team use this data to develop complex models that detect key patterns and information in the sprawling array of intangible connections between people, places, and events. The team must constantly balance the opportunity cost between agility at scale with the overhead of communications when they collaborate with clients. They need an architecture that can deliver machine learning models that are both explainable and easily reproducible.

The Solution

During an internal hackathon, Whitaker, a group of Digital Reasoning developers, and an intern set out to build the next-generation architecture for the company’s deep learning workloads. The team sought to accomplish two goals: (1) find a way to make their constantly changing data behave in the same version-controlled manner as code and (2) use the latest scalable infrastructure possible. The Digital Reasoning team selected Kubernetes to address scalable infrastructure. For its data science platform, it chose Pachyderm.

Using Pachyderm and Kubernetes, the team built scalable, repeatable, and explainable data science pipelines in just one day. Pachyderm enables the team to continuously ingest its constantly changing data end-to-end — with complete provenance and without sacrificing agility.

The Impact

While the team initially set out to build just one pipeline, by the end of the hackathon, they had multiple pipelines set up for different use-cases. For their audio research use case, they built an end-to-end pipeline that would analyze audio files all the way through the transcription process, output the transcripts, and then apply some of the natural language processing components that they were working on onto the output of the transcripts — and so on all the way through to the inference testing. Because things were going so smoothly they even expanded into building pipelines to image analysis. And it didn’t end there.

Whitaker’s team took the new architecture a step further, integrating Jupyter notebooks into the process so its research engineers and data scientists could easily apply changes to any point of their pipeline and watch the impact in real time as Pachyderm automatically implemented those changes. “Pachyderm allows us to look at and component-ize the entire pipeline of analytics and transformations we run. For complex systems, this is incredibly useful to understand the big picture before jumping into the code. We can then easily dive into a specific component to address the needs of a project we are working on.” says Whitaker.

Pachyderm helped Digital Reasoning’s data scientists unearth new insights and carry out rapid experimentation without sacrificing speed or functionality. Pachyderm enables the company’s data scientists to efficiently overcome obstacles, handle data divergence, and generate reproducible outputs. With billions of dollars — and even lives — at stake, Digital Reasoning is always on the lookout for new ways to build accurate models for its clients that inform smart decisions using the best architecture and platforms available.

There’s always a tension between agility, interpretability, and reproducibility, but Pachyderm makes that tension manageable.

Digital Reasoning building out the next-generation architecture for deep learning workloads

The Challenge

The Solution

The Impact

Transform your data pipeline