Pachyderm Overview

Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations.

Pachyderm platform AWS v01

Data-driven pipelines are automatically triggered based on detecting data changes.

Immutable data lineage with data versioning of any data type. 

Autoscaling and parallel processing built on Kubernetes for resource orchestration.

Uses standard object stores for data storage with automatic deduplication. 

Runs across all major cloud providers and on-premises installations.

Get Started with Pachyderm

Choose the edition that is right for your team and organization.
For more details see our pricing page.

Community

For small teams
who prefer to build and support
their own software.

Free

Complete data driven pipeline solution with
data versioning, and
data lineage.

Enterprise

For organizations and teams
that require advanced features
and unlimited potential.

Contact Us

Community Edition with all limitations removed.
Plus SSO, RBAC, and IDM integration.

pipelines automatic code

Cost-Effective Scalability

Deliver reliable results optimizing resource utilization and maximizing developer efficiency.

Run complex data pipelines with sophisticated data transformations with auto scaling and parallelism.

Deduplication of data and code saves infrastructure costs.

Automatic Data-Driven Pipelines

Automatically trigger pipelines based on data changes.

Orchestrate batch or real-time data pipelines from any data source.

Diff-based automation just like a CI/CD system but for data.

Data and Process Deduplication

Versioned data is automatically deduplicated.

Intelligently process only modified data and dependencies. 

Track every change to your data and pipelines automatically as you work. 

Autoscaled Parallelized  Processing

Autoscale jobs up and down based on resource demand.

Automatically parallelized processing of large data sets.

Full process visibility and monitoring using Kubernetes-native tools.

lineage animation v4

Reproducibility

Ensure reproducibility and compliance via immutable data lineage and data versioning for any type of data.

Increase team efficiency and collaboration via git-like structure of commits, branches, and data repositories.

Immutable Data Lineage

All data and pipeline code is versioned providing an immutable record for all activities and assets. 

Track any result all the way back to its raw input. 

Full versioning for metadata including all analysis, parameters, artifacts, models, and intermediate results. 

Data Version Control

Automatic and intelligent versioning of even the largest data sets of unstructured and structured data. 

Git-like structure enables effective team collaboration. 

Diff between two commits of data to debug data, code, or model failures more efficiently.

Flexibility 

Leverage your infrastructure investments and run on your existing cloud or on-premises infrastructure.

Run again any type, size, or scale of data in both batch or real-time pipelines.

Support effective team collaboration through git-like structure of commits.

flexibility v02

Code and Data Agnostic

Container-native pipelines empower developer autonomy.

Use any languages or libraries that are best for the job.

Seamlessly ingest from streaming, real-time, or batch data sources.  

Infrastructure Agnostic

Runs in all major cloud providers and on-premises data centers.

Integrates with existing tools – CI/CD, logging, auth, and data APIs.

Integrates with standard data processing and machine learning tools. 

Composability 

Easily share data sets or pipelines across teams or use cases.

Make any process  data-driven by subscribing to data repo changes.

Microservices-like approach increases reuse and collaboration.

Console 

Console is a complete web UI for visualizing running pipelines and exploring your data.

Map out the overall structure and flow of all pipelines.

View repositories, commit histories, and preview data directly in your browser.

Follow job statuses, pipeline processes, and execution history.

Overview 1

Notebook

JupyterLab mount extension that selectively maps the contents of data repositories right into your Jupyter environment.

Ideal for Data Scientists to explore and analyze data.

Run and test pipeline code against versioned data.

Create reliable, shareable development environments.

 Enterprise Administration

Robust tools for deploying and administering Pachyderm at scale across different teams in your organization.

Centralized licensing and administration of all clusters.

Authentication against any OIDC provider.

Role based access control (RBAC) support for governance and data privacy.

List View Success

See Pachyderm In Action

Watch a short 5-minute demo which outlines the product in action

Try Pachyderm Today

Request a Demo