Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more
AWS logo

Data-Pipelines for AWS

Harness the power and elasticity of Amazon Web Services to automate data transformations with data versioning and lineage. Pachyderm actively participates in AWS ISV Accelerate and runs seamlessly on AWS services such as Elastic Kubernetes Services (EKS), Simple Cloud Storage (S3), Relation Data Services (RDS), Elastic Block Storage (EBS), AWS Fargate, and more.

Automate Complex Pipelines
with sophisticated data
transformations

Kubernetes
Pipeline V2

Orginal Data Sets

V2
  • photos
  • photos
  • photos
  • photos
  • photos
  • photos
  • photos
  • photos
Listening for changes...
Object Store

Transformation Code

V1

  import cv2
  import numpy as np
  from matplotlib import pyplot as plt
  import os

  # edges.py reads an image and outputs transformed image
  def make_edges(image):
    img = cv2.imread(image)
    tail = os.path.split(image)[1]
    edges = cv2.Canny(img,100,200)
    plt.imsave(os.path.join("/pfs/out", os.path.splitext(tail)[0]+'.png?as=webp'), edges, cmap = 'gray')

  # walk images directory and call make_edges on every file found
  for dirpath, dirs, files in os.walk("/pfs/images"):
    for file in files:
        make_edges(os.path.join(dirpath, file))

New Data Sets

V2
  • photos
  • photos
  • photos
  • photos
  • photos
  • photos
  • photos
  • photos
Listening for changes...
Object Store

Automatic Detection

Data-driven pipelines automatically trigger based on detecting data changes.

Version Control

Automatic immutable data lineage and data versioning of all data types.

Autoscaling

Autoscaling and parallel processing built on Kubernetes for resource orchestration.

Automatic Deduplication

Uses standard object stores for data storage with automatic deduplication.

Cloud & On-prem

Runs across all major cloud providers and on-premises installations.

Pachyderm Editions

Pachyderm is available in two editions, Enterprise and Community. Choose the edition that is right for your use case. Read more.

Enterprise

For organizations that require advanced features and unlimited potential.

  • Unlimited Data-Driven Pipelines
  • Unlimited Parallel Processing
  • Role Based Access Controls (RBAC)
  • Pluggable Authentication
  • Enterprise Support
Contact Sales 30 day free trial

Community

For small teams that prefer to build and support their own software.

Complete data-driven pipeline solution with data versioning, and data lineage.

Free Download

Key Features of Pachyderm

Pachyderm is cost-effective at scale and enables data engineering teams to automate complex pipelines with sophisticated data transformations

Scalability

Deliver reliable results faster maximizes dev efficiency.

Automated diff-based data-driven pipelines.

Deduplication of data saves infrastructure costs.

Reproducibility

Immutable data lineage ensures compliance.

Data versioning of all data types and metadata. 

Familiar git-like structure of commits, branches, & repos.

Flexibility

Leverage existing infrastructure investment.

Language agnostic - use any language to process data 

Data agnostic - unstructured, structured, batch, & streaming

Recommended Reading

Read the Blog

Autoscaling Pipelines on AWS

Going beyond the limitations of SQL and using Python to speed development and insight with Snowflake.

Read the Docs

Getting started: Deploy to AWS.

Learn more about this feature in our documention website. This covers getting started with using Pachyderm on AWS.

Read the Docs

Deploy Pachyderm on AWS

This covers how to deploy a Pachyderm cluster on Amazon Elastic Kubernetes Service (EKS)..

See Pachyderm In Action

Watch a short demo which outlines the product in action

Data Pipeline

Transform your data pipeline

Learn how companies around the world are using Pachyderm to automate complex pipelines at scale.

Request a Demo