The focus on models has driven innovative and stable machine learning tools for research and enterprise. The rise of data-centric AI shifts the focus from models and code to the quality and context of your datasets...
As enterprise applications of machine learning have matured, so have the needs of the scientists and engineers building these tools: the ingestion, transformation and interpretation of data has evolved to suit specific audiences, data structures, and use cases.
Selecting a model is a solved problem for many scenarios. However, things are not so simple on the data side. Some data problems still need to be fixed: for example, science and analyst teams can’t access the data they need for an application, pipeline tools can’t handle streaming data, and datasets can be either over-or under-conforming.
Data doesn’t appear out of thin air, especially usable data for enterprise machine learning. However, when working with rich media like image detection and natural language processing, generating more data may not be feasible from an operational or budgetary perspective. So what’s a data engineer to do?
This ebook describes methods to enhance, augment, or increase the amount of high-quality data to feed your model. It also covers best practices for managing complex data labeling and describes how Pachyderm’s data versioning and lineage ensure you don’t lose a high-performing dataset to an accidental overwrite.
Understanding what your data has produced and how it got there makes your machine learning operations more efficient. In addition, it builds deeper trust across your organization in machine learning capabilities for your use cases.
Why focus on versioning and lineage for data and models? Because versioning models is only half of the equation. What dataset was used, when, and why is critical to understand where things have broken down, allowing you to test the same dataset on different versions.