Data-Centric AI Pipelines & Versioning

Optimizing Datasets provides bigger benefits for most teams

Over the last decade, AI/ML researchers have focused on code and algorithms first and foremost.

The data was usually only imported once and generally left fixed or frozen. If there were problems with noisy data or bad labels they'd usually work to overcome it in the code.

Because so much time was spent working on the algorithms, they're largely a solved problem for many use cases like image recognition and text translation.

Swapping them out for a new algorithm often makes little to no difference.

Data-Centric AI flips that on its head and says we should go back and fix the data itself. Clean up the noise. Augment the data set to deal with it. Re-label so it’s more consistent.

There are six essential ingredients needed to put a data centric AI into practice in your organization:

Creative Thinking

Data Centric AI demands creative problem solving and thinking through the solution from start to finish.

Synthetic Data

Synthetic Data is artificially created data used to train AI models. This is more scalable and less error-prone.

DATA AUGMENTATION

Run data through filters or altering it slightly to create more variance in the data and more samples to work with.

TESTING

Build human-in-the-loop tests at every stage for validation of labeling consistency, data integrity and data quality.

Tooling

Robust tooling for data engineering orchestration pipelines and data science experimentation pipelines.

The key to Data Centric AI is treating data as the center of the process not as an after-thought.

Do that and you’re on your way to a powerful Data Centric AI solution that delivers big results in the real world now. This chart illustrates the difference between Model Centric and Data Centric AI:

(Source: A Chat with Andrew: From Model-Centric to Data-Centric AI)

Key Features of Pachyderm

Pachyderm is cost-effective at scale and enables data engineering teams to automate complex pipelines with sophisticated data transformations

Scalability

Deliver reliable results faster maximizes dev efficiency.

Automated diff-based data-driven pipelines.

Deduplication of data saves infrastructure costs.

Reproducibility

Immutable data lineage ensures compliance.

Data versioning of all data types and metadata.

Familiar git-like structure of commits, branches, & repos.

Flexibility

Leverage existing infrastructure investment.

Language agnostic - use any language to process data

Data agnostic - unstructured, structured, batch, & streaming

Data Centric AI