Automating ML Pipelines With Data Lineage: Case Study

The Challenge

Mauricio Borgen and Charles Pepe-Ranney, members of the computational biology team at AgBiome, needed a way to improve the company’s approach to genomic data science. With a rapidly growing number of new microbes coming in from numerous sources and limited infrastructure to support their research processes, Mauricio and Charles realized that a new approach was necessary. To meet the needs of the business, the team needed to start transitioning their manual processes into automated and repeatable pipelines that could scale.

Instead of spending time innovating and using technology as a force-multiplier to the business, the team spent much of their time juggling scripts and analyses. “We had a lot of artisanally crafted, one-off ad-hoc analysis. That doesn’t scale,” said Mauricio. Therefore, he and Charles were determined to find a way to transform their custom, heavily manual workflows into iterative, easy-to-assemble pipelines that could scale with the business.

Pachyderm helps us convert our existing data science pipelines from manually managed scripts to scalable, repeatable end-to-end workflows; enabling us to focus more on developing transformative technology to drive agriculture forward instead of wrangling infrastructure.

Charles learned about Pachyderm after hearing Daniel Whitenack, a Pachyderm employee, on the Data Skeptic podcast and watching Daniel’s many conference talks on YouTube. He saw its potential to refashion their workflows and get data to bench scientists sooner. After exploring various options, they selected Pachyderm because it is the most effective platform for delivering standardized, end-to-end data-science pipelines. It could scale on any infrastructure while providing the team with the flexibility needed to easily leverage new frameworks and languages via Docker containers.

Why AgBiome Chose Pachyderm

Efficiency

Manually managing genome analysis processes for an ever-expanding collection of microbial genomes that already numbers in the tens of thousands means a lot of time and resources focused on wrangling infrastructure instead of their core expertise, data science. The team would need to track down and often produce ad hoc the necessary data for each individual analysis. Pachyderm enables AgBiome to configure repeatable, yet modular pipelines that leveraged Docker® containers. This means that they can standardize aspects of their pipeline and build prefabricated workflows that run automatically as new data is added to the system.

Flexibility

The computational biology team at AgBiome supports a team of around 70 scientists, each having unique requirements and preferences. Mauricio and Charles wanted to streamline these environments without forcing everyone to conform to a certain language. Since Pachyderm leverages Docker® containers and is, therefore, language/framework agnostic, data scientists have the flexibility to choose the right tool for the job without adding additional complexities.

Provenance

Apart from the need to process large amounts of data, AgBiome needs to simultaneously maintain the reproducibility of results. Pachyderm was a natural fit given its ability to version control data, similar to the way Git does with code. This will give AgBiome the ability to track the state of their data over time, backtest models on historical data, share data with teammates, and revert to previous states of data.

The Results

AgBiome is leading the way in the next agricultural revolution with the most innovative, unmatched use of the plant microbiome at scale. To keep up with that pace, they look to Pachyderm to automate their genomics pipelines and convert their existing pipelines from manually managed scripts to scalable, repeatable end-to-end workflows. Their teams can now focus more on developing transformative technology to drive agriculture forward instead of wrangling infrastructure.

Finding the Next-Generation of Plant Microbiomes