Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

22 Essential Pachyderm Commands

Data pipelines help simplify data transformation by abstracting the processes into a reproducible, containerized, scalable format. Pachyderm ensures that the pipelines you do build are powerful, automated, and able to run complex data transformation jobs that are still cost-effective.

Pachyderm pipelines trigger when new data enters its repos, which Pachyderm then processes incrementally. Since it only processes modified or newly added data, Pachyderm is an effective solution for reducing costs and optimizing performance at scale.

This guide gives you an overview of twenty pachctl commands that are essential for building a reproducible, containerized data pipeline with Pachyderm. To bring all the commands together in a true-to-life scenario, we’ll also run through a build of a simple word counter.

Key pachctl Commands to Know

The twenty commands in this guide are grouped according to the Pachyderm object they’re associated with, like pipelines, repos, and files.

Pipelines

Pipelines are Pachyderm’s method for abstracting reproducible and containerized tasks. Each pipeline listens to an input repository and writes data to an output repository once its job is done.

pachctl create pipeline

This command helps you create a pipeline in your Pachyderm instance from a pipeline specification:

pachctl create pipeline -f <url or location of spec file>

pachctl update pipeline

If you want to make a change in a pipeline, you need to update its specification file and run the following command:

pachctl update pipeline -f <url or location of spec file>

pachctl list pipeline

You can use this command to view a list of all pipelines in your Pachyderm instance.

pachctl inspect pipeline

This command gets you metadata and the current state of a pipeline. You need to append it with a pipeline name to view its details:

pachctl inspect pipeline <pipeline name>

You can view data such as name, description, create time, available workers, input and output repo and branch, job details, and more.

pachctl stop pipeline

This command allows you to stop an active pipeline; the pipeline will not be triggered for any data changes in its input repo. Similar to the previous command, you need to append it with the name of the pipeline that must be stopped:

pachctl stop pipeline <pipeline name>

Stopped pipelines can be run again with the pachctl start pipeline command.

Repos

Repositories are a representation of a file system built on top of object storage, and pipelines use them to handle the input and output of data and results. Here are a few common commands associated with repos.

pachctl list repo

This command helps you list all user repos in a Pachyderm instance. It also lets you see information such as the repo creation date and size.

pachctl create repo

You can use this command to create a new repo. You’ll need to append a name for the repo when running this. You can also add an optional description of the repo:

pachctl create repo garden -d "Input repo for the main pipeline"

pachctl delete repo

You can delete an existing repo using this command:

pachctl delete repo garden

You can use the --all flag to delete all repos at once:

pachctl delete repo --all

Files

A file is the lowest-level data object in Pachyderm. It’s used to store data and can be any file type. You need to access and modify files to provide input to and connectivity between your pipelines.

pachctl put file

This command helps you copy a file from your local working directory or using a URL to a location in a Pachyderm repo:

pachctl put file <repo>@<branch> -f <local file location>

pachctl list file

This command lists all the files in a directory. Here’s how to use it to list the files in the root of a repo called garden:

pachctl list file garden@master

You can also drill deep into nested directories:

pachctl list file garden@master:/outer_folder/inner_folder

pachctl get file

You can use this command to view the contents of a file:

pachctl get file <repo>@<branch>:/<path to file>

pachctl inspect file

The inspect command provides you with metadata about a file, such as its path, the datum ID that the file was produced from, type (whether it’s a directory or a file), and its size:

pachctl inspect file <repo>@<branch>:/<path to file>

It gives you an output similar to this:

Path: <path to file>
Datum: <identifier>
Type: file
Size: 0KiB

pachctl delete file

This command is used to delete files from repos:

pachctl delete file <repo>@<branch>:/<path to file>

Commits

Commits in Pachyderm are a snapshot of the current state of a repo. You can add, remove, or modify multiple files in a repo under a single commit to group those operations together.

pachctl list commit

This command lists the commits made to a repo:

pachctl list commit <repo>@<branch>

pachctl inspect commit

This command provides you with metadata related to a commit:

pachctl inspect commit <repo>@<branch or commit id>

You’ll receive details like start and finish times, size, parent commits (if any), and the branch that the commit was made on.

Jobs

A job refers to the simplest unit of operation (computation or transformation) in Pachyderm. Each job runs a containerized workload on finished commits and commits the output of the workload to the pipeline’s output repo. You’ll often need to view your Pachyderm jobs to see their progress and status.

pachctl list job

You can use this command to list all jobs that have been scheduled on your Pachyderm instance:

pachctl list job

Here’s what the output will look like:

ID                               SUBJOBS PROGRESS CREATED     MODIFIED
3c444b790f7e47bf9f07c53beeaf55ae 1       ▇▇▇▇▇▇▇▇ 2 hours ago 2 hours ago 
160e8b18927e45c2852d764a385c6175 1       ▇▇▇▇▇▇▇▇ 2 hours ago 2 hours ago 
9abac1b23227462eada1a5d90f9689ca 1       ▇▇▇▇▇▇▇▇ 2 hours ago 2 hours ago

pachctl inspect job

You can use this command to view the details of a job:

pachctl inspect job <pipeline name>@<job ID>

This provides you with details like start time, duration, state, data transferred, download time, upload time, and process time.

General

There are a few commands that can be used to make changes to multiple types of Pachyderm resources.

pachctl start

The stop command can be used to start a new commit, a stopped pipeline, or a new transaction:

# Start a new commit
pachctl start commit <repo>@<branch>

# Restart a stopped pipeline
pachctl start pipeline <pipeline name>

# Start a new transaction
pachctl start transaction

Transactions can run multiple Pachyderm commands at once. Instead of triggering pipelines every time you make changes to their input repos, you can batch the changes together using a transaction and have the pipelines trigger once for the collective change.

pachctl resume

The resume command lets you set a stopped transaction as active. The commands you run after resuming a transaction are then added to that transaction and batched for execution when the transaction is marked as finished:

pachctl resume transaction <transaction name>

pachctl list datum

You can use the list command to list a pipeline’s datums, the smallest units of computation in Pachyderm.

You can define how your input is divided across datums, and datums can be distributed over multiple worker nodes to improve the efficiency of your pipelines. The list command is powerful because it also allows you to list the datums in a pipeline that does not exist but has a specifications file defined:

pachctl list datum -f <pipeline-spec.json>

If the pipeline exists, you can list its datums using the following command:

pachctl list datum <pipeline>@<job>

pachctl glob file

Pachyderm supports glob patterns to define and query data. You can use glob patterns to filter files based on pattern-matching.

Here’s how you can use the glob file command to list all files that begin with ‘b’ in the master branch of a repo called garden:

pachctl glob file "garden@master:b*”

pachctl draw

The Pachyderm command-line interface (CLI) lets you visualize your pipelines to learn more about how they’re connected and dependent on one another. Here’s how you can use the draw command to print a directed acyclic graph (DAG) of your pipelines’ relations in the terminal:

pachctl draw pipeline

You can also supply a commit ID to the command to print what your pipelines look like after the given commit was made:

pachctl draw pipeline -c <commit_ID>

Building a Reproducible, Containerized Data Pipeline Using pachctl

To understand how to use pachctl commands in a true-to-life scenario, let’s build a word counter using Pachyderm.

Prerequisites

  • A running Pachyderm workspace (it’s best to set it up locally to follow along)
  • The pachctl CLI tool

You can set these up by following this tutorial.

Once you’re ready, clone this GitHub repo using the following command:

git clone https://github.com/krharsh17/pachctl-commands-test.git

Change your working directory to the newly created folder before moving ahead:

cd pachctl-commands-test

Understanding the Example

The word count example is taken from the Pachyderm examples and is based on the map-reduce algorithm. A total of three pipelines can run in parallel to count the number of occurrences of each word in a given webpage. The three pipelines are scraper, map, and reduce.

Scraper

This pipeline takes in a list of webpages to scrape as input and fetches and stores their content in a repo to be used as input by the map pipeline. The list of webpages is in an input repo in the form of files with only one target URL stored in each of them.

Here’s an example.

Map

This pipeline uses a Go script to count the occurrence of each word on all given pages. For each page, the script counts the occurrence of a word and stores it in a separate file for further processing.

For instance, if the content scraped from github.com contained the word license four times, this pipeline will create a file named license in its output repo and store 4 in it.

Reduce

This pipeline takes in all the word-related files created by the map pipeline as input. For each file, it calculates the sum of all numbers stored in it.

Let’s say there are two license files from the previous pipeline, with their contents as 4 and 6. This pipeline will calculate the sum of the contents of all such files and store it in a new file.

Here’s a look at how these pipelines and their repos fit together:

Architecture diagram of the example

Setting Up The Input Repo

Now let’s see how to set everything up using pachctl. Run the following command to set up the input repo for the scraper pipeline:

pachctl create repo urls

This creates a repo called urls in your Pachyderm instance. Once you deploy the scraper pipeline, it will look into the urls repo for input. You don’t need to create any other input/output repos, as the pipelines will automatically create them when needed.

You can view the list of active repos in your Pachyderm instance by running the following command:

pachctl list repo

You’ll see an output similar to this:

NAME CREATED        SIZE (MASTER) DESCRIPTION 
urls 46 seconds ago ≤ B

Next, add a file in the urls repo to trigger the pipelines. The data directory contains a file that you can push directly to the repo. This file contains a list of three webpage URLs from which the pipeline will scrape content for counting their words.

Run the following command to push the file to the repo:

cd data
pachctl put file urls@master -f Wikipedia

This command pushes the file data/Wikipedia to the urls repo. You can check the contents of the repo by running:

pachctl list file urls@master

You’ll receive an output similar to the one below. This indicates that the file has been added successfully:

NAME       TYPE SIZE 
/Wikipedia file 07B

Setting Up The Pipelines

Next, run the following commands to create the three pipelines:

cd .. # If you're still in the `data` directory
pachctl create pipeline -f pipelines/scraper.json
pachctl create pipeline -f pipelines/map.json
pachctl create pipeline -f pipelines/reduce.json

As soon as you create the three pipelines, the scraper pipeline triggers, since its input repo already contains data ready for processing. Once it finishes, it stores its output in the input repo for the next pipeline, ie, the map. Similarly, all pipelines process the data as it moves along in the repos.

Finally, you’ll see a list of files containing the results in the output repo of the reduce pipeline.

You can use the list pipeline command to view the freshly deployed pipelines:

pachctl list pipeline

Here’s what the output would look like:

NAME    VERSION INPUT        CREATED        STATE / LAST JOB DESCRIPTION                                                                                 
reduce  1       map:/        9 seconds ago  running / -      A pipeline that aggregates the total counts for each word.                                  
map     1       scraper:/*/* 14 seconds ago running / -      A pipeline that tokenizes scraped pages and appends counts of words to corresponding files. 
scraper 1       urls:/*      19 seconds ago running / -      A pipeline that pulls content from a specified internet source.

Viewing the Results

You can view all active repos by running:

pachctl list repo

Notice that three new repos other than urls were created automatically, as mentioned before:

NAME    CREATED        SIZE (MASTER) DESCRIPTION                       
reduce  34 seconds ago ≤ 0B          Output repo for pipeline reduce.  
map     39 seconds ago ≤ 0B          Output repo for pipeline map.     
scraper 44 seconds ago ≤ 0B          Output repo for pipeline scraper. 
urls    3 minutes ago  ≤ 107B

You can validate that each job was triggered and run successfully by listing all jobs using the following command:

pachctl list job

Here’s what the output will look like:

ID                               SUBJOBS PROGRESS CREATED        MODIFIED
3c444b790f7e47bf9f07c53beeaf55ae 1       ▇▇▇▇▇▇▇▇ 46 seconds ago 46 seconds ago 
160e8b18927e45c2852d764a385c6175 1       ▇▇▇▇▇▇▇▇ 52 seconds ago 52 seconds ago 
9abac1b23227462eada1a5d90f9689ca 1       ▇▇▇▇▇▇▇▇ 57 seconds ago 57 seconds ago

To understand each step’s output better, you can use the pachctl list file and pachctl get file commands.

Scraper Results

View the output of the scraping pipeline by running:

pachctl list file scraper@master

You’ll notice one directory:

NAME        TYPE SIZE     
/Wikipedia/ dir  928.KiB

Run the following command to list the contents of that directory:

pachctl list file scraper@master:/Wikipedia

You’ll see the following files in the directory:

NAME                  TYPE SIZE     
/Wikipedia/Color.html file 257KiB   
/Wikipedia/Odor.html  file 285.3KiB 
/Wikipedia/Taste.html file 385.8KiB

These were created by the scraper to store the contents of each webpage. The map pipeline was triggered as soon as these files were created.

Mapper Results

View the output of the map pipeline by running:

pachctl list file map@master

You’ll notice a list of files:

NAME                                                               TYPE SIZE     
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/ dir  8.731KiB 
/29211c44a492ff8531747f288fc9541349396439190b05f987fd5a389266dfe9/ dir  7.294KiB 
/dedf39f2ace633be0c3dafbf63d3bd69fddee80d468fcc936a162a47547abf9b/ dir  7.158KiB

These are the directories for each webpage and contain the output files created by the map for that webpage. You can list the files in any of the directories by running a similar command:

pachctl list file map@master:/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838

You’ll be presented with a list of output files:

NAME                                                                           TYPE SIZE
.
.
.
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/ability      file 3B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/able         file 2B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/abnormally   file 2B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/abook        file 3B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/about        file 3B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/aboutsite    file 2B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/above        file 2B   
/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/abovebelow   file 2B
.
.
.

Each file is created for the word that it’s named after and will contain the number of times that word occurred on each webpage. You can check it by running:

pachctl get file map@master:/00625d0e43f1a69e9d18363a0b9d5c9c3a81f295504572c78bd83a8f1072d838/able
5

The output above means that the word able appeared five times in the chosen job’s results. As soon as these files were created, the reduce pipeline was triggered.

Reduce Results

You can view the output of the reduce pipeline by running:

pachctl list file reduce@master

Similar to the map’s results, you’ll receive a long list of files:

NAME            TYPE SIZE 
.
.
/ability        file B   
/able           file B   
/abnormal       file B   
/abnormally     file B   
/abook          file B   
/about          file B   
/aboutsite      file B   
/above          file B   
/abovebelow     file B   
.
.

There’s one file for each word that was identified and counted by the mapper across all web pages. You can see it by running:

pachctl get file reduce@master:/able
15

You can also view a DAG of your pipelines to visualise how they’re connected by running the following command:

pachctl draw pipeline

Here’s what the output will look like:

       +-----------+
       |   urls    |
       +-----------+
             |            
             |            
             |            
             |            
             |            
       +-----------+
       |  scraper  |
       +-----------+
             |            
             |            
             |            
             |            
             |            
       +-----------+
       |    map    |
       +-----------+
             |            
             |            
             |            
             |            
             |            
       +-----------+
       |  reduce   |
       +-----------+

Conclusion

As you can see, the pachctl CLI enables you to easily interact with a Pachyderm instance and create data pipelines. Hopefully the word-counter use case in this article gave you a chance to see the various commands in action, so you can feel comfortable working with them on your own.

Pachyderm offers a quick and seamless way of building reproducible pipelines to streamline data transformation. Once you’ve defined your transformation as a job, you can create as many pipelines as you want based on it. Autoscaling and parallel processing are simple, and you can run these pipelines across all major cloud providers and on-prem setups easily.

Article written by Kumar Harsh