Pachyderm 1.12 GA Release Announcement

|

Pachyderm 1.12 GA Announcement

2020 was a crazy year, to say the least. That’s why we thought it best not to tempt fate by trying to squeeze in a major release during the last few weeks of December. And so, with a renewed sense of optimism, and 2020 now officially in the history books (hooray!), we’re excited to announce the release of Pachyderm 1.12.

Here’s a look at what’s new:

New Pipeline Input Type: Groups

Groups give you a brand new way to combine data from multiple sources on Pachyderm. Similar to a database group-by, Groups in Pachyderm are a special type of pipeline input that enables you to aggregate files from one or more Pachyderm repositories via a particular naming pattern. For example:

Groups within a single repo

$ pachctl list file repo@master
/labs/patientID1-labID1.txt
/labs/patientID2-labID1.txt
/labs/patientID1-labID2.txt
/labs/patientID3-labID1.txt

By configuring our pipeline input type to group and defining our capture group to match on patientID, we can aggregate all of patientID1 lab results into a single datum and create separate datums for all of patient 2 and 3. Neat!

Where Pachyderm Groups really start to shine is when you want to combine data from multiple repos and return the result as a single datum to be processed independently. For example, imagine we have a retail department store chain with multiple stores and different repos storing purchase information, return information and store identity information:

Groups using multiple repos

Repo 1: Purchases
/ORDERW078929_STOREID2.txt file 64B  
/ORDERW080231_STOREID5.txt file 65B  
...
Repo 2: Returns
NAME                       TYPE SIZE 
/ORDERW080231_STOREID5.txt file 65B  
/ORDERW080520_STOREID1.txt file 65B  
...
Repo 3: Stores
/STOREID1.txt file 85B  
/STOREID2.txt file 85B  
/STOREID3.txt file 84B

Let’s say we wanted to get a list of all transactions (purchases or returns) grouped by storeID. Thanks to the new Pachyderm Group input type, this is easy. We simply specify our matching criteria and let Pachyderm handle the rest.

For those familiar with Pachyderm joins, you might be wondering how groups are different? In a nutshell, joins will return a single datum per match. Groups, on the other hand, will return all matches as a single datum.

Try groups out for yourself with this great example.

Automated Deferred Processing with Triggers

Another exciting addition to Pachyderm 1.12 is Triggers, which gives users greater control over how and when to process data.

Take our retail example from earlier. Throughout the day we have lots of transactions happening; for each transaction the company has to pay anywhere from 1-3% of the total plus a flat fee to the bank (aka interchange-rates). A Pachyderm Trigger could help reduce those costs by automatically deferring each transaction’s processing using a predefined set of criteria – For example, every night at 10pm, or in batches of 20.

Now, instead of paying a processing fee for each transaction, we only pay it once. Cha ching!

Give Pachyderm Triggers a try.

Pachyderm 1.12 Enterprise Additions:

Pachyderm Enterprise includes everything mentioned so far as well as a few other goodies:

Other Noteworthy Items:

Improved Spouts We re-architected spouts to improve stability and security while also making it easier to integrate with external data sources.

New “outer joins” Where inner joins in Pachyderm will only return matched results, outer joins will return a result regardless of whether there’s a match or not.

pachctl list datum Now includes a dry run option for testing glob patterns.

pachctl update pipeline Now supports transactions.

Interested in learning more about Pachyderm? Schedule some time with one of our experts

About the Author

Nick Harvey

Nick Harvey is the Head of Marketing at Pachyderm and a father of two. He's spent the last decade working on open source, machine learning, and all things Kubernetes.