Bringing versioned data from the cloud to your fingertips
A data-centric mindset is crucial in AI. Your models can’t represent changes in the real world if your data is frozen in time. This is why Pachyderm has been the go to platform for Data Engineers for some time now. Pachyderm allows you to put data development at the center of your workflow.
But one of the most common drawbacks of Pachyderm is that the production-ready rigor makes experimentation, especially from data scientists, difficult. Today, we’re announcing something to alleviate that difficulty, extending the Pachyderm platform with our new JupyterLab Pachyderm Mount Extension.
What is JupyterLab?
Jupyter notebooks are one of the most commonly used tools for Data Scientists. They are the connection point between experimentation and understanding, allowing scientists to actively interact with data to learn and apply their knowledge. It’s the place where we connect our code to our data and Jupyter notebooks provide one of the most convenient ways to do this.
Not only can we write code to transform our data, but we can also document our thinking alongside our functions – simple, streamlined, computational documents. Data scientists can communicate their understanding of the data – what conclusions can be drawn from an experiment, what should be tried next, or where things went wrong. Functioning as an active research journal, experiments can be shared, reproduced (if correctly managed), and iterated on.
JupyterLab is a next-generation interface to Jupyter notebooks. It provides a more developer-friendly interface for configuring and extending our data science workflows. It also allows for extensions to expand the capabilities of Jupyter notebooks. And this is why we chose to build a Pachyderm-specific extension.
Versioned Data + Notebooks with the JupyterLab Pachyderm Mount Extension
The aim of Pachyderm’s JupyterLab mount extension is simple: make it feel like versioned data in Pachyderm is on your computer.
Pachyderm’s scalable data versioning system is one of the most powerful features of its platform. But interacting with that data can be cumbersome. We thought, “Rather than requiring users to run multiple terminal commands, why don’t we just make it a few clicks to get to the data you really want? Let’s put versioned datasets right at people’s fingertips.” So that’s what we did. We implemented a JupyterLab extension that selectively maps the contents of data repositories right into your Jupyter environment.
After connecting to your cluster via the login interface, you are presented with a list of the repos in your cluster. Any named branch in a repo can be “mounted” into your file system via the Jupyter environment by clicking the mount button next to the repo.
Under the hood, the extension connects to your versioned data stored in Pachyderm and simulates a mounted drive on your file system at `/pfs`. You don’t have to worry about how much data is stored in your repo – files are lazily loaded, saving you time on loading data as you access it.
To make it easier to interact with the mounted data, the extension also provides you with a file browser to the `/pfs` location. This lets you explore, search and open the data you’ve mounted in the same way you would the files on your computer.
The mount extension can be tremendously useful for data science experiments. For example, say you are a data scientist on a team trying out new techniques with data stored in Pachyderm. Without disturbing the Data Engineers workflows, you can create a branch for a data repository (`V1`) and have all the data scientists on your team mount this branch and evaluate their notebooks on this version. Once the next version of the dataset (`V2`) is ready, any data scientist can switch to this version of the dataset (knowing that they can always get back to the original version) to see how effective their technique is on newer data.
Not only are they able to test out their code on the newer version, they also don’t have to change anything because it is accessible at the same location as the previous version (`/pfs`).
More to come…
Data accessibility is crucial for data scientists. But this can’t come at the expense of reliability. At Pachyderm we want to empower users to solve problems with data in the most efficient and effective way possible while also providing a data management platform that gives everyone confidence that they aren’t losing anything.
With the Pachyderm Jupyterlab Mount Extension, we’re providing a tool to simplify the way users interact with data stored in Pachyderm repositories. This not only improves development speed and quality of experimentation, but also facilitates collaboration across an organization. With the Pachyderm Jupyterlab extension we are now able to provide simple and reliable data development environments all pointing to a single source of truth.