Data Versioning – Comparing DVC with Pachyderm

The explosion in machine learning (ML) is a recent phenomenon. As the field becomes more and more mature, versioning is becoming increasingly important for reproducibility and rollback. The ML development process relies on continuous iteration, where developers search for the best-performing model by changing data, code, and pipelines. By capturing versions and the corresponding data lineage, it’s possible to recreate or revert an ML model easily. There are many approaches to versioning. This post will look at one from Iterative called DVC, and the other from Pachyderm.

Overview: DVC and Pachyderm

Data Version Control (DVC) is an open-source data versioning tool written in Python. Created by Iterative, DVC is a solution that utilizes Git (GitHub, GitLab, Bitbucket) to version data, code, pipelines and metrics. Because it’s built on Git, it utilizes commands and concepts from Git, making it easy to get started and use if you’re already familiar with Git. You can share datasets and models and collaborate with your team using a remote repository (Google drive, object storage, and more).

Pachyderm is a data versioning and data pipelining tool written in Go, that can be used for many different applications. At its core, Pachyderm is data-centric with auto-triggering of pipelines, data/pipeline/code versioning, data deduplication, parallelization, and incremental data processing. Unlike many machine learning operations (MLOps) solutions, Pachyderm supports structured and unstructured data. In Pachyderm pipelines, users can easily shard their data and elastically spin up workers to distribute the data processing of the user’s code across multiple machines.

Critical Comparison: DVC vs. Pachyderm

 

DVC Pachyderm
Community 987 forks and 10.5K GitHub stars. 536 forks and 5.6K GitHub stars.
Deployment options Self-managed. Self-managed.
Ease of use Easy to get started locally. Mimics Git commands for data versioning. Setting up remote storage takes infrastructure knowledge Built on Kubernetes and object storage, can have a learning curve for someone new to Kubernetes.
Storage Options Supports many remote storage types (AWS S3, Azure Blob Storage, Google Cloud Storage, Google Drive, SSH, HDFS, WebHDFS, HTTP, WebDAV, and local). Any object storage (AWS S3, Azure Blob Storage, Google Cloud Storage) and local.
Programming languages supported Language and framework agnostic. Language and framework agnostic. Transformations or code run inside Docker containers.
Scalable for production use No, changing a single byte requires storing the whole dataset again as a separate version. Gets slow as versions of data and size of datasets grow. Yes, has native petabyte scale, data deduplication, and data-triggered processing.
Data, pipeline and code versioning Yes , versions your code and the DAG config that calls it. Dependencies and environment management are left to the user. Yes, Docker images contain all code and dependencies. DAG config automatically captures data dependencies in Pachyderm data repos.
Automatic, immutable data versioning No, manually done by submitting add command. User must enforce immutability. Yes, automatically creates immutable versions when user defines a DAG.
Data Pipelining Yes Yes
Visualized DAGs in GUI No Yes, through Console.
Data-Driven Pipelines (triggers when data is changed or added) No, pipeline steps are not auto-triggered. Yes, deployed pipelines and DAGs can automatically trigger when data dependencies change.
Automatic Data Lineage No, data lineage is manually compiled through DVC’s YAML and Lock files. Yes, data lineage is automatically captured with complete versioning of data, pipeline and code.
Parallelization or distributed processing No. User code is executed locally on the user’s machine. User data can be distributed or sharded across multiple workers/machines, processed and then reassembled. Also workers/machines can be used to run user code concurrently.
Data source supported Batch only. Batch and streaming data.
Interfaces Command line, Python API and Visual Studio extension. CLI, GUI, gRPC, Python, Golang, and Javascript client.
Enterprise support No, but users can get support by purchasing Iterative Studio, which includes DVC. Yes, SLA-backed enterprise support available.

When to use DVC

Because DVC is a wrapper around Git, it uses a familiar tool that data scientists and data engineers already understand. This makes it a good choice for small teams for prototypes and non-data-intensive workloads. Like Git, users download datasets locally to their laptop and push the changes to a remote repo for sharing. There are a lot of ‘weekend warriors’ or individuals working on pet data science datasets for which this is a perfectly acceptable solution.

However with production datasets quickly exceeding 20 GB, synchronizing data with a remote repo quickly becomes untenable from a memory, bandwidth, and local
storage perspective. In addition, without user discipline, the local and remote caches can easily get out of sync. DVC does allow for the management of external datasets, but it’s not recommended per the docs. In summary, DVC is not scalable for large data projects which makes it ideal for experimentation and individual usage.

When to use Pachyderm

Pachyderm works best in data science, data engineering or machine learning operations, which require enterprise functionality and production scale (large datasets). MLOps projects often require data engineers to be able to reproduce every version of data, metadata, parameters, models and code throughout the ML lifecycle. Enterprises deploying ML models in a production setting need automatic versioning of data, code, and pipelines as well as auto-triggering of pipelines. They need to iterate faster by processing data incrementally and distributing computations across multiple workers, while facilitating collaboration between groups. Pachyderm supports all of these requirements and is flexible enough to work with multiple programming languages and frameworks as well as structured and unstructured data.

Some of the most popular use cases for Pachyderm include: Natural Language Processing, video and image processing, genomics analysis, data preparation, fraud or risk assessment, and more.

Summary

Both DVC and Pachyderm are excellent data versioning solutions with different approaches and intentions. Individual users will find DVC to be a great fit for small projects. However, some users will quickly outgrow DVC as their use case expands in complexity, such as graduating to larger datasets or requiring collaboration with larger teams. This is where Pachyderm shines, because it’s purpose-built for large-scale production environments with native versioning, role-based access controls and a shared, cloud-based environment for the entire organization.

Pachyderm has many customers who have successfully migrated from DVC by leveraging our support team, best practices and quick-start guide. Interested in learning more? Take the next step and schedule a demo tailored to your environment.