Adarga Processes Unstructured Data At Scale With Pachyderm

Key Benefits

Repeatability and Traceability in Model Development
Parallelization and Scale to Accelerate Production Readiness
ML Governance from Training to Production
10-12x Improvement in Processing Speed

Before Pachyderm, when we had to change our original dataset to something significantly larger it was highly disruptive; we basically had to start over again. With Pachyderm that would have been a simple transition. That’s a lightbulb moment for a data scientist.

Business Challenge: Unstructured Data

Adarga’s powerful AI platform is helping analysts, planners and decision-makers to rapidly identify threat and opportunity signals buried within huge volumes of unstructured data – both in-house and open-source. Users can produce detailed reports in seconds, monitor complex situations and understand intricate networks. By accelerating data processing and augmenting human intelligence, organizations can mitigate risk, act at speed and gain competitive advantage.

Why is this important? Unstructured data is exploding. This information is vast, complex and ever-changing, leading to less than 1% of unstructured data being analyzed by a typical organization (Harvard Business Review). Maintaining a dynamic understanding of what’s going on in this environment is difficult, leading to potential threats and opportunities being missed. Adarga’s Knowledge Platform allows organizations to harness the power of natural language processing (NLP), machine learning (ML) and network science technology to effectively and efficiently deal with this data complexity – unlocking its true value. The challenge for Adarga is how to develop, train, productionalize and scale the necessary data models.

To address these challenges, Adarga chose Pachyderm to drive efficiency and reproducibility in its MLOps practices.

Historically, we weren’t very efficient. Developing with any velocity was challenging, everything was ad hoc. Pachyderm allowed us to build a repeatable lab environment that made it much easier to get new models into production

Technical Challenge:
Repeatability and Traceability in Model Development

In order to continue improving the underlying NLP technology for information analysis, Adarga’s data scientists must develop and evaluate different ML models for efficiency and effectiveness. Pachyderm is central to this pre-production model training for the company.

“Historically, we weren’t very efficient,” says Olly Stephens, principal architect at Adarga. “Developing with any velocity was challenging, everything was ad hoc. Pachyderm allowed us to build a repeatable lab environment that made it much easier to get new models into production.”

Pachyderm provides clear understanding of data lineage during model experimentation, giving Adarga’s data scientists the insight needed for traceability and reproducibility. This effectively creates a controlled environment for Adarga, allowing the team to quickly assess and understand model development.

“Having a solution for data versioning is key,” notes Stephen Bull, data science manager at Adarga. “Pachyderm has helped us drive this consistency in our modelling.”

As part of its MLOps process, Adarga creates Pachyderm pipelines, uses Seldon to transform the data and develop models, then glues it all back together again with Pachyderm. As the company settles on production candidates, it can cement and coalesce cells in Seldon, recognizing that it has the needed traceability of all data thanks to Pachyderm.

“With Pachyderm, it becomes really slick to migrate models from our lab into production,” Adarga’s data science manager says.

Technical Challenge:
Parallelization and Scale to Accelerate Production Readiness

Beyond reproducibility, Adarga also relies on Pachyderm pipelines to scale its MLOps and accelerate production readiness.

Pachyderm offers several key advantages for data processing. First, it only processes new data as it’s added rather than rerunning an entire data set, significantly decreasing overall processing times. It also allows teams to switch and scale data sets without impacting the underlying architecture.

“Pachyderm provides us with more flexibility on projects where there is uncertainty,” says Bull. “Before Pachyderm, when we had to change our original dataset to something significantly larger it was highly disruptive. We basically had to start over again. With Pachyderm, that would have been a simple transition. That’s a lightbulb moment for a data scientist.”

Pachyderm also speeds development by allowing the team to take advantage of parallel processing and GPU resource sharing. “For one particular project, we were able to use Pachyderm to split up pre-processing across multiple parallel pipelines, providing a 10-12x reduction in processing time,” notes Bull. “We were done in about 20 minutes.”

Pachyderm has also provided the company with a convenient way of spinning up multiple agents for a Weights and Biases hyperparameter optimization sweep, accelerating the team’s ability to explore different models. Pachyderm even allows Adarga teams to more efficiently share cluster resources through autoscaling, thereby preventing wasteful idle time. Adarga also uses Pachyderm pipelines for queue-style processing, where many jobs can be submitted simultaneously and are picked up as resources become available.

Technical Challenge:
ML Governance from Training to Production

Ultimately, Pachyderm has allowed Adarga to significantly narrow the gap between data science research and product development. Pachyderm facilitates this MLOps best practice by providing audit trails and traceability from production all the way back to training.

“With Pachyderm, we’re able to create more visible stages along the route to production,” says Bull. “This reduces risk and gives our product managers much more confidence in what is being built. This consistent approach is hugely valuable for a small company with limited resources.”

In fact, the team at Adarga sees implementing good MLOps practices as a story it can tell customers. “Raising the profile of the team and the work we do ultimately builds trust with our customer base,” Bull explains. “Pachyderm is a major enabler in achieving this.”

Another key feature the Adarga team likes about Pachyderm is its dashboard, which provides easy access for pipeline inspection. This is especially important for product managers and other less technical team members, who can easily drag and drop files for a specific ML process. “This dashboard is a feature that we’ve seen the Pachyderm team expanding quite quickly, and we’re quite excited for that,” Bull says.

For one particular project we were able to use Pachyderm to split up pre-processing across multiple parallel pipelines, providing a 10-12x reduction in processing time.

The Future
Building an MLOps Platform for the Future

The data science team at Adarga has come to rely on Pachyderm to maintain a really fast cadence for bringing new ML models to production, and allow anyone in the organization to log in and see that progress. That in turn has significantly improved the exposure of data science across the organization, and increased confidence within product development.

“Pachyderm enables lots of different things within our organization, and it’s been so useful to be able to demonstrate those MLOps improvements to people through Pachyderm’s straightforward front end,” says Bull. “Going forward we’ll be doing all model training, experimentation and production through Pachyderm pipelines.”

Identify Opportunities and Threats Hidden in Data

Key Benefits

Business Challenge: Unstructured Data

Technical Challenge:
Repeatability and Traceability in Model Development

Technical Challenge:
Parallelization and Scale to Accelerate Production Readiness

Technical Challenge:
ML Governance from Training to Production

The Future
Building an MLOps Platform for the Future

Download the Case Study

Key Benefits

Business Challenge: Unstructured Data

Technical Challenge: Repeatability and Traceability in Model Development

Technical Challenge: Parallelization and Scale to Accelerate Production Readiness

Technical Challenge: ML Governance from Training to Production

The Future Building an MLOps Platform for the Future

Download the Case Study

Technical Challenge:
Repeatability and Traceability in Model Development

Technical Challenge:
Parallelization and Scale to Accelerate Production Readiness

Technical Challenge:
ML Governance from Training to Production

The Future
Building an MLOps Platform for the Future