Webinar recap: Pachyderm 101 Installation, Configuration and Core Concepts
One of the best things about the Pachyderm community is their curiosity and creativity. Our users bring us interesting problems they’re trying to solve with Pachyderm every day in Slack and on GitHub, and this webinar was no different, so we wanted to make a roundup of the Q&A from our webinar community in a blog post. Enjoy!
Recently, we held a Pachyderm webinar titled “Pachyderm 101: Installation and Core Concepts” with Brody Osterbuhr, one of our Customer Support Engineers. Brody spends his time onboarding and supporting Pachyderm’s customers in making the most of their data versioning and pipeline DAGs. The focus of this webinar was to get started installing Pachyderm, provisioning Kubernetes, and the configuration needed to build your first pipeline.
This webinar contains a great overview of some core Pachyderm concepts that power our incremental processing, diff-based version control, and flexible DAG management. Give it a watch to get your first cluster up and running, set up your first pipeline, and run some version-controlled processing jobs with Pachyderm.
Does data always need to be pushed into repositories, or can Pachyderm pull? Also, how can data be output out of the PFS automatically? (i.e. how does Pachyderm integrate with other systems?)
Data does not have to be stored in Pachyderm. Your pipeline code can grab that data, process it, and output it to wherever you want it to go – you can use Pachyderm’s S3 bucket, or totally skip Pachyderm’s S3 bucket and put it into your own storage. Pachyderm isn’t very picky about that, but you will miss out on incrementality, deduplication, and parallel processing if you aren’t using a Pachyderm repo. Depending on the use case, though, it can sometimes be possible to architect a way around this. To output data from PFS: a cron pipeline for output can move data from an output repository to wherever you need, it’s pretty easy. As for integration, we have integrations and a team that can help you get your data where you need it to be. If you reach out to us on Slack we can help you find the solution to that.
I note pipelines with large outputs take a long time in “Finishing”. Can you explain what Pachyderm is doing in that state, and why it can take long time?
This has historically been something that takes longer than you might expect, and it’s a performance improvement our engineering team is focused on. What’s happening during that step is compaction – ensure your data is as compact and efficiently chunked as possible for your data reads downstream, so that future data reads are as performant as possible. Our performance testing shows that with this compaction step, your DAG sees an overall performance benefit because of the time saved on reading downstream.
When putting local files into Pachyderm File System, are the files copied to PFS storage or just their metadata?
If you are doing pachctl put file, you are uploading that file to essentially S3 object storage with pointers to the location of the original and output files, so you are not just uploading metadata, but the file itself. Learn more about this command.
Is Red Hat Openshift a commonly-used Kubernetes cluster for Pachyderm?
Can pachyderm deal with data streams? If I want to read events off a queue, would this result in a single file for that queue, or a file for each message?
It’s really about how you craft your pipeline. Depending on the use case, you might use Pachyderm Spouts to treat each item in the queue as a commit, or treat the queue as a whole as its own file.
This is going to depend on how often new files are arriving, and the size of the files:
If you are getting new files every few seconds and they’re 10kb each, it would make sense to batch these into commits rather than making a commit for each one.
If you get new video files every couple of hours, making each of those its own commit might make more sense.
Part of what’s great about Pachyderm is that you can solve a lot of different problems by changing your architecture approach based on your use case, and this is a great situation to bring to our Slack community. We can help you solve the problem in Slack, or maybe set up an office hours session to dive deeper into the best way to approach your use case.
Does it interoperate different “sources” & “connectors” for other IT systems like MongoDB, or catalog software?
Our integrations team is always working on new connectors – but right now, you can connect nearly any data to a Pachyderm repo, depending on your pipeline type: grab data from your MongoDB database once a day with a cron pipeline, or use a spout pipeline to build a data-driven pipeline that processes files through your DAG as soon as new data is available.
Can we use Python functions to return a list of files instead of glob patterns?
Not currently. We do have a feature request for customizing this query. Right now, though, the glob pattern function is very flexible – take a look at how you can use it to access the datums that you want to process.
Is there a way to have local machines involved in a pipeline, eg, in training when you may want to use your own GPU resources?
Yes, this is something that our customers do frequently. In a pipeline, you can use pipeline steps to tell a computing system what to run, how to run it, and what resources to use.
For using and installing Pachyderm Console, are there any prerequisites to installing and using it?
As of Pachyderm 2.3, Console is enabled by default and is a core feature of Pachyderm in the Helm values file. In my production testing cluster, I use an external DNS name, and you can add user auth and other security to it – Console is very powerful and useful, and we’re looking to improve it all the time.
Have more questions of your own? Our engineering team is active in the Pachyderm Slack community, helping users set up and troubleshoot their pipelines. We’re also active on Github – and make sure you don’t miss our next webinar by signing up for our Newsletter.