Data versioning is essential for effective big data and data-centric AI & analytics applications. When a dataset has files added, removed, or modified, version control creates a record of those changes so that users can review how that data has been changed, what changed, and who made those changes. This can be as simple as adding a suffix to document file names, or as complex as knowing the full change log and access history for sensitive personnel data, or the most recent stable production versions of a cloud app.
In today’s organizations, versioning data is essential for many reasons: Understanding how data has changed over time provides important background for analytics and operational decision-making. Having a history of older versions of your data makes reversion possible in the event of data or performance loss with a newer version. The business benefits of data versioning, especially for a technology business, are the following:
Producing Reliable Results: Resources like data are dynamic, meaning they constantly change due to the flow of information. Therefore, developers of AI/ML models have to ensure they use the correct versions of their data to produce accurate outcomes based on set assumptions, whether they are looking to process all new data or process data as it was 6 months ago. Without a reliable and retrievable version of the dataset the model was trained on, they cannot replicate the experiment and expect the same results as last time.
Make Better Decisions: When working with data, it is important to remember that it is not always correct or accurate. Trust but verify is the best approach – and there’s no verifying unversioned data. Updates are often made to address dataset errors or conform with formatting and processing needs, resulting in newer versions of the data. With accurate and reliable version history for business and analytics data, customers and internal teams are more confident in implementing strategic, data-driven decisions.
Meet Compliance Requirements Faster: Strict data collection and privacy regulations in industries like healthcare and finance can make compliant machine learning and analytics applications challenging. With a version control method that can give teams a history not just of when a datum was changed but how it was changed, Pachyderm enables teams to know what version of a machine learning model processed an individual file, and when. This creates a fully auditable history of your data, and the ways it was used and transformed across time.
Pachyderm’s data versioning is built into its data-driven pipelines. The software was built to allow users to define the size of their datum – the smallest individual unit that can be defined the dataset, that is then used for processing the data with a machine learning model. Because of this, Pachyderm sees more than other pipeline tools: where others see a dataset, Pachyderm is able to see and direct each datum. It tracks changes to each datum over time with a global ID, and is able to use this version control and ID tracking to scale data processing and to trigger data pipelines only when there is something new or different to process.
Learn more about how we approach data, and why data-driven pipelines are essential to scaling your DataOps with this blog post: Treat Data with the Rigor of Code by Building Datum-Centric Pipelines.
Keep track of your versioned data better with Pachyderm. Featuring one of the best-in-class automated data versioning, it gives your teams more control over data management. File-based versioning offers a complete audit trail across data pipelines at any stage.« Back to Glossary Index