Advancing Science With ML At Scale: Case Study

The Challenge: Scale Existing Data Infrastructure

Within their plasma experimentation program, in which large devices called plasma injectors heat hydrogen gas to millions of degrees, General Fusion has collected a large set of complex data from thousands of sensors. While the conditions in which this information was captured may sound extreme, the challenges associated with managing this data, scaling data processing capabilities, and sharing scientific results are similar to those of other modern technology companies.

The true tipping point in our decision to use Pachyderm was its version control features for managing our data.

General Fusion had outgrown its existing data infrastructure. “We needed to evolve our data systems to match our increased analysis and collaboration efforts, and we wanted to leverage well-supported, off-the-shelf technologies that could move, scale and adapt as we grow our data and our company,” said Brendan Cassidy, General Fusion’s Open Innovation Manager. They began the search for an infrastructure provider who could meet their data storage, processing and sharing needs while adhering to two important criteria:

Augment (not “rip and replace”) General Fusion’s existing experimental and analysis workflows
Facilitate collaboration with external scientific partners through seamless, ad hoc sharing of large sets of experimental data

We found a limited set of options based on our requirements and pretty quickly narrowed it down to Pachyderm because of its ability to store arbitrary amounts of unstructured data and it can scale to meet computational demand.

Using Pachyderm as the Data Foundation for Scientific Collaboration

While programmers use version control systems such as Git to manage and collaborate on a shared codebase, an additional level of complexity exists for data scientists who work with both code and data. Pachyderm enables data science teams to develop reproducible and distributed data workflows without interfering with each other’s analysis.

“The true tipping point in our decision to use Pachyderm was its version control features for managing our data,” said Jonathan. “Our researchers no longer have to copy data locally or worry about a calibration update changing the underlying data while they’re analyzing it.”

Data in Pachyderm is versioned similarly to how code is managed in Git. It is organized into repositories where users can create commits (immutable snapshots), view diffs, and add or manipulate files like in any standard file system.

Pachyderm also provides complete data lineage (aka provenance) for every piece of data throughout the cluster. Every data transformation is tracked, allowing any result to be 100 percent reproducible and verifiable — an important consideration for any organization that relies on accurate analysis.

Conclusion

With Pachyderm, the General Fusion team can stay focused on plasma physics instead of designing and maintaining big data systems. The combination of language-agnostic infrastructure and version controlled data allows them to efficiently develop and iterate on their data analysis.

Pachyderm is committed to bringing a new paradigm of data infrastructure to the big data community through its open source platform and professional services. “What surprised us the most about our new infrastructure was the value the Pachyderm team brought to our deployment,” said Brendan. “Pachyderm developers have been committed to helping us transition to our new system and to adding functionality to meet our needs.” Pachyderm brings the perfect combination of advanced technology to solve modern data science challenges and flexible support services for efficient deployment.

What surprised us the most about our new infrastructure was the value the Pachyderm team brought to our deployment.

Building a Best-of-Breed MLOps Stack for GeneralFusion with Pachyderm

The Challenge: Scale Existing Data Infrastructure

Using Pachyderm as the Data Foundation for Scientific Collaboration

Conclusion

Transform your data pipeline