Finding a Scalable Alternative to Complicated Delivery Pipelines
One of the top for-profit managed healthcare providers, with affiliate plans covering one in eight Americans for medical care, has the mission to be the most innovative, valuable and inclusive partner in health benefits. Given their mission, it’s no surprise that they have a dedicated AI team who are looking to leverage cutting edge AI to harvest long term insights and make much more detailed health predictions from claims and electronic health record data.
That data store is massive, with more than 50 terabytes of data covering the company’s tens of millions of members across the U.S. They’re mining this data to determine treatment efficacy based on past outcomes given particular patient characteristics. Ultimately, this could allow a provider to sit down with a patient and discuss, out of dozens and dozens of different possible treatments, the best options for that patient’s specific situation by matching them to a similar cohort.
“This is a hard challenge because there’s so much data, so many different characteristics and so many possible combinations,” says an engineering lead at this top US-based healthcare company. “But it’s exciting, because once you get these insights into the hands of providers, it’s revolutionary. And that’s part of the joy of data engineering: taking a problem that is really big and doing the scalability optimizations needed to solve it really fast.”
Getting these potential insights into the hands of healthcare providers is where the challenge comes in. It’s one thing to have small scale implementations working in a lab, it’s another to deliver machine learning at scale.
When the engineering lead joined the AI team, they had a very complicated data delivery pipeline based on Apache Airflow. While it worked, it wouldn’t scale beyond a single pipeline or container instance at a time. As the team looked to solve those scaling issues, it also began to button up its production approach, ensuring that data was versioned and that it had immutable lineage.
But the more the team worked on these problems, the clearer it became they’d need an entire separate engineering team before it was solved.
That’s when a co-worker mentioned Pachyderm.
Pachyderm: the Data Foundation for Machine Learning
Pachyderm provides the data layer that allows machine learning teams to productionize and scale their machine learning lifecycle. With Pachyderm’s industry leading data versioning, pipelines and lineage, teams gain data-driven automation, petabyte scalability and end-to-end reproducibility. For RTL Nederlands, Pachyderm was the key to combining and orchestrating the various subtasks into a unified way to process videos at scale. Not only that, but video processing is resource intensive. Pachyderm’s incrementality allowed the team to only process new videos as they arrive or change, rather than reprocessing everything from scratch. This delivered tremendous speed to their approach, saving time and money.
“Build reproducibility is one of those things that’s important like 5% of the time, but in that moment, it becomes critical,” notes the engineering lead. “Being able to take a buggy component or build and rewind to determine the cause is extremely valuable, and that’s exactly what Pachyderm gave us. I would have been completely lost without it.”
The AI team had looked at tools like DVC, but they just didn’t compare to the promise of Pachyderm. “Conceptually, DVC worked well, but as I looked closer at the framework, it felt like it was for smaller, more research- oriented projects,” explained the engineering lead. “I didn’t see how DVC would get us to a production environment. Pachyderm would.”
Pachyderm delivered the parallelism and data handling required to efficiently scale the AI team’s ML processing. Importantly, while the company had millions of patient records, only a small subset were relevant at any given time, and Pachyderm’s incrementality saved significant time, money and resources by only processing the subset of data that had changed, rather than the entire patient universe.
Specifically, the AI team has a wealth of rich information about each member, but that data is all represented as a gigantic table. The AI team had to process the entire table for every single use case, regardless of whether those member records were relevant. Obviously, this slowed the pipeline and wasted tremendous processing power.
“My initial question with Pachyderm was, how would it handle this table?” the engineering lead explains. “The epiphany was that Pachyderm allowed us to completely shift how we thought about processing our data.”
With Pachyderm, the team was able to arbitrarily partition table data to only capture events for a single member – effectively creating individual member objects that encapsulate all the events for a particular member. Pachyderm not only processed these records in parallel, it also automatically processed only those containing new information, increasing both scale and speed while reducing costs.
“I think I really understood the value of Pachyderm when I realized all I had to care about was where the inputs came from and where the data would go. Pachyderm magically does the rest, moving the right data where it needs to be in an efficient and scalable way. It’s a really well thought out abstraction and automation solution that solves the ML problem beautifully for a broad range of use cases.”
Top Healthcare Provider
Efficiency and Scale for Healthcare Insights
Since starting with Pachyderm, this top healthcare provider saw a significant improvement in its processing efficiency. The AI team runs its pipelines on a weekly basis to accommodate updates to the source data; originally this meant processing the entire two terabyte table each time.
“With Pachyderm, we do one initial run, then only process updated or changed events on a weekly basis,” says the engineering lead. “That turns out to be less than 10% of the total – only 7.5GB of data – so we’re getting a huge 90% savings on our incremental runs for the pipeline, week in and week out. That’s really amazing.”
The AI team also really appreciates Pachyderm’s data lineage capabilities. The team’s original approach was to create a metadata file that was referenced for each insight, so they could trace provenance when needed. “But it was really brittle,” notes the engineering lead. “If we changed a path, we’d lose the lineage. With Pachyderm, it was as simple as adding the job and commit IDs to the insight. They’re not tied to any specific path, there’s no security issue, and the team can easily interpret them. I love that, it’s a huge win.”
While the AI team is working to get its first insights into the hands of providers, it’s already actively planning on how to use the Pachyderm-based system for treatments on a host of additional medical conditions. As part of that, the AI team has found Pachyderm very receptive in discussing new features and integrations that will further improve their efficiency.
Other data science, data engineering and DevOps teams within the healthcare provider are also taking notice, and the engineering lead finds Pachyderm’s design facilitates this broader embrace.
“One of my first observations with Pachyderm was what matters most in the pipeline is that the right data shows up at the right place and time,” he notes. “That level of abstraction allows our teams to do 90% of development without even deploying anything to Pachyderm at all. When we get to production, all we do is change the pipeline path and everything runs just fine. This philosophy of creating a container at the last moment really protects the data scientists from having to deal directly with the infrastructure.”
In the end, the thing that mattered most was feeling like Pachyderm had thought clearly about the engineering problems and delivered a broad and elegant solution that scaled easily. “I think I really understood the value of Pachyderm when I realized all I had to care about was where the inputs came from and where the data would go. Pachyderm magically does the rest, moving the right data where it needs to be in an efficient and scalable way,” he said. “It’s a really well thought out abstraction and automation solution that solves the ML problem beautifully for a broad range of use cases.”
“Conceptually, DVC worked well, but as I looked closer at the framework, it felt like it was for smaller, more research-oriented projects. I didn’t see how DVC would get us to a production environment. Pachyderm would. ”
Top Healthcare Provider