As a data engineer, there are many tools to choose from that create and automate data pipelines. Both Pachyderm and Airflow are popular solutions in this category because they eliminate manual bottlenecks and accelerate time to data insights. Because of this, MLOps practitioners often find themselves comparing Airflow and Pachyderm. Let’s examine the critical differences between both solutions and identify which use cases favor one over the other..
Data Pipelines: Airflow and Pachyderm
Apache Airflow is an open-source, batch-oriented data pipeline solution written in Python. It originated at Airbnb to help the company manage complex workflows. Users define a workflow or directed acyclic graphs (DAGs) in Python scripts. Airflow then schedules and executes the workflow, which is composed of tasks, based on a time interval or event. Fundamentally a data pipeline tool, Airflow excels at scheduling and executing a series of tasks and their associated dependencies.
Pachyderm is a data pipelining and ML operations solution written in Go. At its core, Pachyderm is data-centric with auto triggering of pipelines, data and pipeline versioning, data deduplication, parallelization, and incremental data processing. Unlike many machine learning operations (ML Ops) solutions, Pachyderm supports structured and unstructured data. Users can shard their data and elastically spin up workers to distribute the data processing of an individual task/transformation across multiple machines.
Critical Comparison: Airflow vs. Pachyderm
11.2K forks and 27.5K GitHub stars.
536 forks and 5.6K GitHub stars.
Self-managed or hosted through Astronomer, Google, and AWS.
Self-managed in cloud or on-premise.
Celery and Kubernetes
Python only. There’s no easy way to plug in different languages.
Language and framework agnostic. Transformations or code runs inside Docker containers.
Yes, through DAG Views.
Yes, through Console.
Data-Driven Pipelines (triggers when data is changed or added)
DAGs can be chained together but changing or adding data will not automatically trigger a pipeline run.
Can automatically trigger a pipeline run when data is added or changed.
Data Versioning (used to reproduce a particular outcome)
No versioning of data. DAGs are versioned with a source code management system such as GitHub.
Versions data, pipelines and provides data lineage natively. Data versions are stored in any cloud or on-prem object store.
Data storage deduplication (reduces processing time, storage, and costs)
N/A. No data management.
Natively dedupes data before processing.
Does not capture any data lineage, metadata, or data versioning information
Data lineage captured with complete versioning of data, pipeline and transformation code.
Incremental data processing (reduces processing time and costs)
No incremental data processing. All data is processed in every run.
Identifies what data was changed and only processes the diff.
Parallelization or distributed processing
Each task is managed by one worker/machine. However, users can scale the number of workers to run multiple tasks concurrently.
Each task or transformation can be distributed or sharded across multiple workers/machines and then reassembled. Also workers/machines can be used to run tasks concurrently.
Data source support
Batch only, not recommended for streaming data sources.
Batch and streaming data sources supported.
CLI, GUI, and REST API.
When to use Airflow
Airflow works best with workflows that are mostly static and slowly changing (days and weeks and not hours or minutes). It is a good fit when there’s already a strong handle on data management for ML and tools in place for versioning, lineage, etc. There are plenty of instances when data versioning isn’t important for some projects. Airflow is not built to move large quantities of data from one task/transformation to the next. According to the project’s readme, the best practice is to “delegate high-volume, data-intensive tasks to an external service that specializes in that type of work”. Also, “Airflow is not a streaming solution, but it is often used to process real-time data, pulling data off streams in batches.”
The most common use cases for Airflow are ETL pipelines that extract batch data from multiple sources and run a transformation or Spark job. Other uses include the automatic generation of reports, running backups, or ingesting web logs into a database.
When to use Pachyderm
Pachyderm can also be used as a data pipelining solution but works best in data science, data engineering or machine learning operations which require a higher level of functionality and scale. MLOps projects often require data engineers to be able to reproduce every version of data, metadata, parameters, models and code throughout the ML lifecycle. Engineers often add data to large, terabyte-sized datasets, which demand a data-driven solution that automatically kickstarts DAGs, can process data incrementally and distributes a task across multiple workers. Pachyderm supports all of these requirements and is flexible enough to work with multiple programming languages, frameworks as well as structured and unstructured data.
Both Apache Airflow and Pachyderm are excellent data pipelining solutions. If your use cases are limited to moving batches of data through a series of data processing steps, then stick with Airflow. If you find that your organization is expanding data pipelining efforts to support Machine Learning, Pachyderm may be a better fit. If you already have a huge investment in Airflow DAGs, you don’t have to rewrite your DAGs in Pachyderm, because Pachyderm complements Airflow. You can use Pachyderm to kickoff Airflow DAGs and gain benefits such as automatic data processing, reproducibility, and incremental data processing.
Pachyderm has many customers who have successfully migrated from Airflow by leveraging our support team, best practices and quick-start guide. Interested in learning more? Take the next step and schedule a demo tailored to your environment.