Powering Complex Data Workflows For SeerAI

Key Benefits

Scalable across massive data sets
Easily manage massive and complex ML Workflows
Automatic data versioning of large unstructured data sets
Handles Petabyte-Scale Machine Learning Workloads

We built our platform on Pachyderm because of its scalability, cloud native deployment, full data lineage tracking and lightly opinionated patterns for repositories and pipelines. Pachyderm just fit really well with what we were doing.

Business Challenge:
Powering Global Data Fusion at Scale

SeerAI’s flagship offering, Geodesic, is the world’s first decentralized platform optimized for deriving insights and analytics from planetary-scale spatiotemporal data.

Working with spatiotemporal data is a challenge. Because it concerns planetwide questions, the data sets are massive in scale – often entailing petabytes of imagery. The data itself can come from different sources, requiring the ability to load and manage from a decentralized data model. Finally, that data is generally heterogeneous and unstructured, and thus notoriously complex and difficult to deal with.

SeerAI designed Geodesic to constantly grow in knowledge and data relationships so that it can eventually answer most any question. “You might start with a question on corn crop yields that then transitions to climate, supply chain, biofuel production, etc. to cut across many verticals – meaning we’re dealing with a broad range of data sets, with complex inter-relationships and at a massive scale,” notes Daniel Wilson, co-founder, and CTO at SeerAI. “Controlling the data ingest, ML job scheduling, model interaction, and data versioning can be extremely complex at this scale.”

You can build these really complex workflows, and in every case Pachyderm serves as this amazing glue to link all of these systems together.

Technical Challenge:
Scalability Across Massive Data Sets

Spatiotemporal data sets are very large and entail petabytes of image data. Scalability was a key requirement that SeerAI had to ensure was addressed. SeerAI considered other platforms, such as Apache Airflow, for its data ingest, but ultimately chose Pachyderm. “We built our platform on Pachyderm because of its scalability, cloudnative deployment, full data lineage tracking and lightly opinionated patterns for repositories and pipelines,” explains Wilson. “Pachyderm fit really well with what we were doing.”

Pachyderm is cloud-native and highly scalable, which allows SeerAI to easily create and work with multiple pipelines and repositories for its data science workflows. In addition, Pachyderm automatically takes care of triggering transformations, data sharing, data versioning, parallelism, and resource management allowing the data to be delivered more efficiently. Pachyderm’s ability to provide automatic incremental processing saves compute by only processing differences and automatically skipping duplicate data. Since the pipeline and data are all managed by Pachyderm it can autoscale with parallel processing without writing any code.

Technical Challenge:
Managing Massive and Complex ML Workflows

Pachyderm works with the core microservices in Geodesic for heterogeneous data search and preparation. In fact, the team actually built a specific component, called Blackhole, that runs on top of Pachyderm as the platform’s ETL framework, allowing Geodesic to ingest and receive raw unstructured data from a variety of sources and feed it into a messaging system

Instead of just throwing that raw, unstructured data into a data lake, the team uses Pachyderm within Blackhole to handle processing and formatting so the data is readily queryable. Specifically, Wilson uses a combination of Spout, a Pachyderm feature for handling streaming data ingestion, along with Cron pipelines for running scheduled tasks.

Pachyderm also allows the team to better control machine learning job management. “Rather than trying to do everything within our own framework, we find Pachyderm makes it really, really easy to combine and spawn different workflows,” he explains. “You can build these really complex workflows – maybe we’re running a job in Pachyderm or another microservice, maybe we’ve got a large amount of data to work with in Databricks – and in every case, Pachyderm serves as this amazing glue to link all of these systems together.”

Pachyderm provides the necessary machine learning job management for Geodesic to simultaneously handle the scheduling, running, and interactions between multiple ML jobs.

Technical Challenge:
Data Versioning of Large Data Sets

Finally, the team uses Pachyderm’s versioning control for data product management and maintenance, allowing them to understand how various changes to the input, modeling, and code influence the ML output.

One challenge for the team was understanding how to retain Pachyderm’s data lineage when data moves out of the pipeline into another microservice. “When data leaves Pachyderm to another system, it’s no longer directly versioned in Pachyderm, and that’s critical information we’re concerned about losing,” says Wilson. “Fortunately we found Pachyderm plays really well when it comes to lineage.”

To maintain its version tracking, Wilson’s team ensured that data written into Geodesic is augmented with Pachyderm-based version metadata. Pachyderm commits information that’s in the response allowing the team to understand the source location within Pachyderm, leading back to the original pipeline that created it and all the associated code within that pipeline. SeerAI uses GitOps on their Pachyderm pipelines so that everything can be traced back to the exact code to produce an output, even several steps downstream of Pachyderm. “Even though we’re querying data outside Pachyderm, we were still able to trace it back because the ingest is fully managed through Pachyderm’s data-driven pipelines.”

Overall, Pachyderm simply enables us to do things that we really can’t do with alternatives.

The Future
SeerAI’s Future Proofing its MLOps platform

Moving forward, the SeerAI team is investigating how to take advantage of Pachyderm’s ability to work with object data when it’s stored directly within Pachyderm, as opposed to another system like Amazon S3 or Google Cloud Storage. Because Pachyderm excels at versioning individual pieces of data, Wilson can write directly into a Pachyderm repository and let it manage and track the metadata associated with individual data chunks, effectively “versioning” an entire data array.

“These data sets can be huge, and every time something changes I don’t want to have to rebuild the whole data set or store separate version information for all the changes,” notes Wilson. “When I write into a Pachyderm repository I’d actually have all that information natively.”

With Pachyderm, Wilson won’t have to store the entire data set, just what’s changed. The team can maintain lineage on the data sets themselves, tracing output back to the models, model inputs and even back to the code that generated them.

“This gives us full lineage on all the data we store in these systems, all thanks to Pachyderm,” he says. “Overall, Pachyderm simply enables us to do things that we really can’t do with alternatives.”

Delivering actionable Insights on spatiotemporal data

Key Benefits

Business Challenge:
Powering Global Data Fusion at Scale

Technical Challenge:
Scalability Across Massive Data Sets

Technical Challenge:
Managing Massive and Complex ML Workflows

Technical Challenge:
Data Versioning of Large Data Sets

The Future
SeerAI’s Future Proofing its MLOps platform

Download the Case Study

Key Benefits

Business Challenge: Powering Global Data Fusion at Scale

Technical Challenge: Scalability Across Massive Data Sets

Technical Challenge: Managing Massive and Complex ML Workflows

Technical Challenge: Data Versioning of Large Data Sets

The Future SeerAI’s Future Proofing its MLOps platform

Download the Case Study

Business Challenge:
Powering Global Data Fusion at Scale

Technical Challenge:
Scalability Across Massive Data Sets

Technical Challenge:
Managing Massive and Complex ML Workflows

Technical Challenge:
Data Versioning of Large Data Sets

The Future
SeerAI’s Future Proofing its MLOps platform