2020 was a crazy year, to say the least. That’s why we thought it best not to tempt fate by trying to squeeze in a major release during the last few weeks of December. And so, with a renewed sense of optimism, and 2020 now officially in the history books (hooray!), we’re excited to announce the release of Pachyderm 1.12.
Here’s a look at what’s new:
New Pipeline Input Type: Groups
Groups give you a brand new way to combine data from multiple sources on Pachyderm. Similar to a database group-by, Groups in Pachyderm are a special type of pipeline input that enables you to aggregate files from one or more Pachyderm repositories via a particular naming pattern. For example:
Groups within a single repo
$ pachctl list file repo@master /labs/patientID1-labID1.txt /labs/patientID2-labID1.txt /labs/patientID1-labID2.txt /labs/patientID3-labID1.txt
By configuring our pipeline input type to
group and defining our
capture group to match on
patientID, we can aggregate all of patientID1 lab results into a single datum and create separate datums for all of patient 2 and 3. Neat!
Where Pachyderm Groups really start to shine is when you want to combine data from multiple repos and return the result as a single datum to be processed independently. For example, imagine we have a retail department store chain with multiple stores and different repos storing purchase information, return information and store identity information:
Groups using multiple repos
Repo 1: Purchases /ORDERW078929_STOREID2.txt file 64B /ORDERW080231_STOREID5.txt file 65B ...
Repo 2: Returns NAME TYPE SIZE /ORDERW080231_STOREID5.txt file 65B /ORDERW080520_STOREID1.txt file 65B ...
Repo 3: Stores /STOREID1.txt file 85B /STOREID2.txt file 85B /STOREID3.txt file 84B
Let’s say we wanted to get a list of all transactions (purchases or returns) grouped by storeID. Thanks to the new Pachyderm Group input type, this is easy. We simply specify our matching criteria and let Pachyderm handle the rest.
For those familiar with Pachyderm
joins, you might be wondering how groups are different? In a nutshell,
joins will return a single datum per match. Groups, on the other hand, will return all matches as a single datum.
Automated Deferred Processing with Triggers
Another exciting addition to Pachyderm 1.12 is Triggers, which gives users greater control over how and when to process data.
Take our retail example from earlier. Throughout the day we have lots of transactions happening; for each transaction the company has to pay anywhere from 1-3% of the total plus a flat fee to the bank (aka interchange-rates). A Pachyderm Trigger could help reduce those costs by automatically deferring each transaction’s processing using a predefined set of criteria – For example, every night at 10pm, or in batches of 20.
Now, instead of paying a processing fee for each transaction, we only pay it once. Cha ching!
Pachyderm 1.12 Enterprise Additions:
Pachyderm Enterprise includes everything mentioned so far as well as a few other goodies:
- Group support for OIDC
- Auth-enabled Extract/Restore
Other Noteworthy Items:
Improved Spouts We re-architected spouts to improve stability and security while also making it easier to integrate with external data sources.
New “outer joins” Where inner joins in Pachyderm will only return matched results, outer joins will return a result regardless of whether there’s a match or not.
pachctl list datum
Now includes a dry run option for testing glob patterns.
pachctl update pipeline
Now supports transactions.
Interested in learning more about Pachyderm? Schedule some time with one of our experts