Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

Video and Image ETL at Scale

Video and imaging ETL is characterized by large unstructured data sets that can create bottlenecks for teams as they look to productionize and scale.

bc diagram

Breast Cancer Detection

In the example below we show how to create a scalable pipeline for breast cancer detection. 

There are different ways to scale inference pipelines with deep learning models. We implement two methods here with Pachyderm: data parallelism and task parallelism.

  • In data parallelism, we split the data, in our case breast exams, to be processed independently in separate processing jobs.
  • In task parallelism, we separate out the CPU-based preprocessing and GPU-related tasks, saving us cloud costs when scaling.

View the code on GitHub

Key Features of Pachyderm

Pachyderm is cost-effective at scale and enables data engineering teams to automate complex pipelines with sophisticated data transformations


Deliver reliable results faster maximizes dev efficiency.

Automated diff-based data-driven pipelines.

Deduplication of data saves infrastructure costs.


Immutable data lineage ensures compliance.

Data versioning of all data types and metadata. 

Familiar git-like structure of commits, branches, & repos.


Leverage existing infrastructure investment.

Language agnostic - use any language to process data 

Data agnostic - unstructured, structured, batch, & streaming

Data Pipeline

Transform your data pipeline

Learn how companies around the world are using Pachyderm to automate complex pipelines at scale.

Request a Demo