PPS is a containerized processing engine for versioned data. If you’d like to learn about how data is version controlled you can read about the Pachyderm File System (PFS).
Data and Code were meant to be unified. Containerizing them together unlocks Reproducibility and Collaboration for your team.
Running your code in a container and accessing the data through Pachyderm’s version control system (PFS) guarantees that the analysis is Reproducible. And because it’s just a container, you can use any language or libraries you want.
Reproducibility is the requirement for true Collaboration. By enabling Reproducibility with containers, Pachyderm allows each team member to develop data analysis locally and then seamlessly push the same code into a production cluster.
Data Access Model
On each container that runs your analysis, the data is exposed via the Pachyderm File System (PFS) interface as a local file system in the container. To access your data from within the container, you simply read files from the input directory and write results to the output directory.
Pachyderm parallizes your computation by only showing a subset of your data to each container. A single node either sees a slice of each file (a map job) or a whole single file (a reduce job). The data itself lives in any object store of your choice (usually S3 or GCS) and PPS smartly assigns different pieces of data to be processed by different containers.
Jobs & Pipelines
Each Job takes a Pachyderm Data Repository (PDR) as input and the results of the job are stored in an output PDR. In this way, every atomic processing step is versioned, which means the intermediate state of the data is always stored.
Pipelines keep your processing in sync with your data. A Pipeline is simply a Job that subscribes to a Pachyderm Data Repository. When new input data is available in the repository, the Pipeline will automatically process the new data and update its output PDR.
Composing pipelines together creates a Directed Acyclic Graph (DAG) of processing steps.
Since Pipelines are only triggered by new data, only the relevant parts of your DAG will run when new data is present.
And since each processing step utilizes Pachyderm Data Repositories, you have complete confidence in your data’s Provenance. If you have a surprising result, you can debug or validate it by understanding its historical processing steps.
Scaling up the number of nodes is applicable per Pipeline. That means you can spend your firepower on the parts of your DAG that require it.