Over the last decade, AI/ML researchers have focused on code and algorithms first and foremost.
The data was usually only imported once and generally left fixed or frozen. If there were problems with noisy data or bad labels they'd usually work to overcome it in the code.
Because so much time was spent working on the algorithms, they're largely a solved problem for many use cases like image recognition and text translation.
Swapping them out for a new algorithm often makes little to no difference.
Data-Centric AI flips that on its head and says we should go back and fix the data itself. Clean up the noise. Augment the data set to deal with it. Re-label so it’s more consistent.
There are six essential ingredients needed to put a data centric AI into practice in your organization:
Do that and you’re on your way to a powerful Data Centric AI solution that delivers big results in the real world now. This chart illustrates the difference between Model Centric and Data Centric AI:
(Source: A Chat with Andrew: From Model-Centric to Data-Centric AI)
Pachyderm is cost-effective at scale and enables data engineering teams to automate complex pipelines with sophisticated data transformations
Deliver reliable results faster maximizes dev efficiency.
Automated diff-based data-driven pipelines.
Deduplication of data saves infrastructure costs.
Immutable data lineage ensures compliance.
Data versioning of all data types and metadata.
Familiar git-like structure of commits, branches, & repos.
Leverage existing infrastructure investment.
Language agnostic - use any language to process data
Data agnostic - unstructured, structured, batch, & streaming
Watch a short demo which outlines the product in action