How A Healthcare Provider Does Machine Learning

Key Benefits

Shrinks processing and storage requirements 90%
Increases scalability and speed with automatic parallel processing
Ensure reproducibility with immutable data lineage
Abstracts automation so AI teams only need care about data inputs and outputs

We really understood the value of Pachyderm when I realized I just needed to understand where the data inputs were and then Pachyderm magically did all the rest.

Business Challenge:
Simplifying the Data Delivery Pipeline

One of the top for-profit managed healthcare providers, with affiliate plans covering one in eight Americans for medical care, has the mission to be the most innovative, valuable and inclusive partner in health benefits. Given their mission, it’s no surprise that they have a dedicated AI team who are looking to leverage cutting edge AI to harvest long term insights and make much more detailed health predictions from claims and electronic health record data

Technical Challenge:
Massive Unstructured Data Sets

That data store is massive, with more than 50 terabytes of data covering the company’s tens of millions of members across the U.S. They’re mining this data to determine treatment efficacy based on past outcomes given particular patient characteristics. Ultimately, this could allow a provider to sit down with a patient and discuss, out of dozens and dozens of different possible treatments, the best options for that patient’s specific situation by matching them to a similar cohort.

This is a hard challenge because there’s so much data, so many different characteristics and so many possible combinations,” says an engineering lead at this top US-based healthcare company. “But it’s exciting, because once you get these insights into the hands of providers, it’s revolutionary. And that’s part of the joy of data engineering: taking a problem that is really big and doing the scalability optimizations needed to solve it really fast.” Getting these potential insights into the hands of healthcare providers is where the challenge comes in. It’s one thing to have small scale implementations working in a lab, it’s another to deliver machine learning at scale.

The power of Pachyderm was its ability to work with any data. Our data has many nuances and there are so many combinations. The joy as a data engineer was setting this up once and letting Pachyderm take care of it automatically

Technical Challenge:
Complexity of Pipelines

When the engineering lead joined the AI team, they had a very complicated data delivery pipeline based on Apache Airflow. While it worked, it wouldn’t scale beyond a single pipeline or container instance at a time. As the team looked to solve those scaling issues, it also began to button up its production approach, ensuring that data was versioned and that it had immutable lineage

Pachyderm delivered the parallelism and data handling required to efficiently scale the AI team’s ML processing. Importantly, while the company had millions of patient records, only a small subset were relevant at any given time, and Pachyderm’s incrementality saved significant time, money and resources by only processing the subset of data that had changed, rather than the entire patient universe.

Specifically, the AI team has a wealth of rich information about each member, but that data is all represented as a gigantic table. The AI team had to process the entire table for every single use case, regardless of whether those member records were relevant. Obviously, this slowed the pipeline and wasted tremendous processing power.

With Pachyderm, the team was able to arbitrarily partition table data to only capture events for a single member – effectively creating individual member objects that encapsulate all the events for a particular member. Pachyderm not only processed these records in parallel, it also automatically processed only those containing new information, increasing both scale and speed while reducing costs

My initial question with Pachyderm was, how would it handle this table? The epiphany was that Pachyderm allowed us to completely shift how we thought about processing our data.

Technical Challenge:
Data Versioning of Large Data Sets

“Build reproducibility is one of those things that’s important like 5% of the time, but in that moment, it becomes critical,” notes the engineering lead. “Being able to take a buggy component or build and rewind to determine the cause is extremely valuable, and that’s exactly what Pachyderm gave us. I would have been completely lost without it.” The AI team had looked at tools like DVC, but they just didn’t compare to the promise of Pachyderm.

Conceptually, DVC worked well, but as I looked closer at the framework, it felt like it was for smaller, more research-oriented projects. I didn’t see how DVC would get us to a production environment.

The Results
Efficiency and Scale for Healthcare Insights

Since starting with Pachyderm, this top healthcare provider saw a significant improvement in its processing efficiency. The AI team runs its pipelines on a weekly basis to accommodate updates to the source data; originally this meant processing the entire two terabyte table each time.

The AI team also really appreciates Pachyderm’s data lineage capabilities. The team’s original approach was to create a metadata file that was referenced for each insight, so they could trace provenance when needed. “But it was really brittle,” notes the engineering lead. “If we changed a path, we’d lose the lineage. With Pachyderm, it was as simple as adding the job and commit IDs to the insight. They’re not tied to any specific path, there’s no security issue, and the team can easily interpret them. I love that, it’s a huge win.”

While the AI team is working to get its first insights into the hands of providers, it’s already actively planning on how to use the Pachyderm-based system for treatments on a host of additional medical conditions. As part of that, the AI team has found Pachyderm very receptive in discussing new features and integrations that will further improve their efficiency.

Other data science, data engineering and DevOps teams within the healthcare provider are also taking notice, and the engineering lead finds Pachyderm’s design facilitates this broader embrace.

“One of my first observations with Pachyderm was what matters most in the pipeline is that the right data shows up at the right place and time,” he notes. “That level of abstraction allows our teams to do 90% of development without even deploying anything to Pachyderm at all. When we get to production, all we do is change the pipeline path and everything runs just fine. This philosophy of creating a container at the last moment really protects the data scientists from having to deal directly with the infrastructure.”

In the end, the thing that mattered most was feeling like Pachyderm had thought clearly about the engineering problems and delivered a broad and elegant solution that scaled easily.

“I think I really understood the value of Pachyderm when I realized all I had to care about was where the inputs came from and where the data would go. Pachyderm magically does the rest, moving the right data where it needs to be in an efficient and scalable way,” he said. “It’s a really well thought out abstraction and automation solution that solves the ML problem beautifully for a broad range of use cases.”

Pachyderm's abstraction and automation system is really well designed. This has solved the ML problem for us beautifully for a broad range of use cases.

Actionable Medical Insights from Terabytes of Clinical Data

Key Benefits

Business Challenge:
Simplifying the Data Delivery Pipeline

Technical Challenge:
Massive Unstructured Data Sets

Technical Challenge:
Complexity of Pipelines

Technical Challenge:
Data Versioning of Large Data Sets

The Results
Efficiency and Scale for Healthcare Insights

Download the Case Study

Key Benefits

Business Challenge: Simplifying the Data Delivery Pipeline

Technical Challenge: Massive Unstructured Data Sets

Technical Challenge: Complexity of Pipelines

Technical Challenge: Data Versioning of Large Data Sets

The Results Efficiency and Scale for Healthcare Insights

Download the Case Study

Business Challenge:
Simplifying the Data Delivery Pipeline

Technical Challenge:
Massive Unstructured Data Sets

Technical Challenge:
Complexity of Pipelines

Technical Challenge:
Data Versioning of Large Data Sets

The Results
Efficiency and Scale for Healthcare Insights