Data-Centric AI

Optimizing Datasets Provides Bigger Benefits For Most Teams

data centric ai diagram 02 2

Over the last decade, AI/ML researchers have focused on code and algorithms first and foremost.

The data was usually only imported once and generally left fixed or frozen. If there were problems with noisy data or bad labels they’d usually work to overcome it in the code.

Because so much time was spent working on the algorithms, they’re largely a solved problem for many use cases like image recognition and text translation. 

Swapping them out for a new algorithm often makes little to no difference.

Data-Centric AI flips that on its head and says we should go back and fix the data itself. Clean up the noise. Augment the data set to deal with it. Re-label so it’s more consistent.

There are six essential ingredients needed to put a data centric AI into practice in your organization:

Creative Thinking

Data Centric AI demands creative problem solving and thinking through the solution from start to finish.

Synthetic Data

Synthetic Data is artificially created data used to train AI models.

Data Augmentation

Data Augmentation involves running data through filters or altering it slightly to create more variance in the data and more samples to work with in your models. 


Build human-in-the-loop tests at every stage to validate things like labeling consistency, as well as data integrity and data quality. 

Clarifying Instructions

Clarify your instructions to labelers and data engineers as the data goes through the iterative process of labeling, clarifying and labeling again.


Tooling includes both data engineering orchestration pipelines and data science experimentation pipelines.

The key to Data Centric AI is treating data as the center of the process not as an after-thought. 

Do that and you’re on your way to a powerful Data Centric AI solution that delivers big results in the real world now. This chart illustrates the difference between Model Centric and Data Centric AI:

Model Centric vs Data Centric

(Source: A Chat with Andrew: From Model-Centric to Data-Centric AI)

Pachyderm: The Data Foundation for a Data Centric Approach to AI

Pachyderm’s data-driven ML pipelines and data versioning provide a strong and flexible foundation for a data centric approach to AI.


Pachyderm’s data driven automation makes it easy for teams to quickly iterate on data from a wide variety of sources to build and improve models:

  • Pipelines execute automatically based on data changes
  • Pipelines are code and framework agnostic
  • Data sharding and parallel processing without requiring additional code
  • Pipeline access to data without SDK or API required


Pachyderm efficiently scales to petabytes of data making it possible put data at the center of your ML approach while staying within your budget:

  • Deduplication as well as incremental and parallel processing reduce storage and compute costs while speeding processing time
  • Support for both unstructured and structured data
  • Support for any file type


Pachyderm provides a robust set of tools for tracking data lineage so you can easily assess the impact of data changes on model performance:

  • All data changes captured automatically via Git-like commits
  • Use branching to easily compare the effect of data changes to model performance
  • Easily visualize and track data changes across your entire DAG via our Console
  • Full lineage provided for both code and data changes across all pipeline steps including metadata, artifacts and metrics

See Pachyderm In Action

Watch a short 5-minute demo which outlines the product in action

Try Pachyderm Today