One of the major features included in Pachyderm v1.8 (and being backported to 1.7.11) is improved support for large files of structured data. Specifically, users who want to use Pachyderm as their versioned data lake and dump large swaths of CSV and SQL data into Pachyderm repos to track how those files change over time. Pachyderm v1.8 now has the ability to ingest structured data as a single file and automatically chunk it up to be run as a distributed workload across the cluster. This was one of the biggest requests from our community members trying to do more ETL and aggregation workloads in Pachyderm.
Lets Roll Up Our Sleeves
To ingest SQL data (into the data repo on the master branch) and have Pachyderm take care of all the splitting you just need to run:
pachctl put-file data master users --split sql -f users.pgdump
When you use pachctl put-file –split sql … your pg dump file is split into three parts: the header, rows, and the footer. The header contains all the SQL statements in the pg dump that setup the schema and tables. The rows are split into individual files (or if you specify the –target-file-datums or –target-file-bytes multiple rows per file). The footer contains the remaining SQL statements for setting up the tables.
The header and footer are stored on the directory containing the rows. This way, if you request a get-file on the directory, you’ll get just the header and footer. If you request an individual file, you’ll see the header plus the row(s) plus the footer. If you request all the files with a glob pattern, e.g.
/directoryname/*, you’ll receive the header plus all the rows plus the footer, recreating the full pg dump. In this way, you can construct full or partial pg dump files so that can be processed independently.
Of course SQL data is just one example. For CSV data, the behavior is the same, but the steps are slightly different as you need to define the header manually. We’ll be making this smarter in a future release, but now you can ingest a CSV file in two steps.
First, add the data. In this case we’re creating one file for line of our CSV. Just as with SQL, you can easily change that to chunks of rows using –target-file-datums or –target-file-bytes.
cat users.csv | tail -n +2 | pachctl put-file bar master users --split line
Now we’ll add the header itself:
cat users.csv | head -n 1 | pachctl put-header bar master users
If you want to learn more details about working with structured data and headers/footers, check out our documentation.