At Pachyderm, we just released our hosted service for public beta. This post is a reflection on that journey. Kevin Delgado - Software Engineer
Background and Motivation
Pachyderm is a platform for managing data. It provides intuitive semantics for versioning data (similar to what git provides for source code) and a straightforward way to interact with this data via composable “data pipelines” where each pipeline is simply some code that has been containerized.
Pachyderm runs on top of Kubernetes. Kubernetes is the gold standard for sanely managing tons of containerized applications, whether they are in the cloud or on your own hardware.
Until now, to use Pachyderm’s suite of commands and APIs, users needed their own Kubernetes clusters. This proved to be a large barrier to entry, especially for those without ops/infrastructure experience. While Kubernetes might be a prereq for running Pachyderm, using Pachyderm should ideally not involve directly interacting with Kubernetes at all.
For years we have been aiming to build a hosted Pachyderm service, Pachyderm Hub – or simply Hub – which would give any data scientist, regardless of their Kubernetes expertise, the opportunity to try out Pachyderm and see if the paradigm we offer for managing data makes sense for their use case.
From January to April of this year a couple of disconnected efforts attempted to get Hub off the ground. One engineer endeavored to create a proof of concept UI around the user story of launching clusters from the browser, known as Pachapult. It spiked out a user’s holistic experience of logging into Hub and launching a cluster linked to their profile. In doing so, it launched simple, self-hosted Kubernetes clusters that were barely powerful enough to make it through the basic Pachyderm tutorial.
A separate effort, known as Pachuchet or Chet, began out of our frustration with not having a push-button way of launching production-grade Pachyderm clusters for load-testing changes to our core storage layer. Unlike Pachapult, that focused on user experience and held little regard for the strength of the actual cluster it launched, Chet had no front-end and focused on generating a usable (capable of running a simple yet robust load test), Pachyderm cluster at the push of a button.
By April, it was still unclear whether Chet would evolve into the backend for Hub, or remain a simple service primarily used for load testing, and likewise whether Pachapult would evolve into the front-end for Hub or remain a simple demo app, but at this point we felt an API spec would be a necessary step regardless. Collaborating on an API spec brought out discussions that served to clarify a lot of this uncertainty, and efforts did begin to converge. The engineer working on Pachapult continued to focus on the UI leaving the cluster launching side of things to Chet, which was starting to evolve into a full-fledged API. Many challenges remained ahead, but overall there seemed to be a much clearer and more unified vision of where things were going.
Going forward, from engineering team buy-in to initial launch, we saw an overarching theme emerge in our decision making process. As with most MVPs, we often picked the faster option over the cheaper, more efficient, or more elegant option.
Managed Kubernetes vs Self-Hosted
Our hosted Pachyderm clusters run on Kubernetes clusters. One of the very first decisions we grappled with was whether to manage these Kubernetes clusters ourselves on cheap compute nodes or to deploy a managed kubernetes service such as GKE or EKS. Pachapult, the early attempt at a UI for hosted Pachyderm clusters, launched its own Kubernetes cluster on a single Digital Ocean droplet. Note, this was not DO’s managed kubernetes service, but our own kubernetes running on DO nodes.
The main advantage of self-hosting is cost. We would have more control over using expensive compute resources as efficiently as possible (and avoid the extra managed Kubernetes cost). There was even some talk initially about self-hosting something like k3s, a lighter weight version Kubernetes, because of the limited Kubernetes functionality a cluster running only Pachyderm requires. Ultimately, we felt the development and maintenance time saved by managed kubernetes outweighed the cost savings at this stage and chose to go that direction.
Multi-cluster vs Multi-tenancy
Along a similar vein, we needed to decide if we were going to run multiple pachyderm clusters in one Kubernetes cluster, or spin up a separate instance of managed Kubernetes for each Pachyderm cluster. Obviously, one Kubernetes cluster per Pachyderm cluster is hugely inefficient and far more expensive than running multi-tenancy (that is, many Pachyderm clusters on one Kubernetes cluster cordoned off via namespaces).
Prioritizing the security of users’ data, with separate clusters we can rest easy knowing that there is no way for things to go awry such that any information leaks between clusters. This alone was enough to tilt the scales, and thus we pushed forward, resolute in our decision – at least initially – to offer separate managed Kubernetes instances as a service, or dare we say it, Kubernetes as a service as a service.
GKE vs AWS, push vs pull semantics, client libraries vs shelling out to kubectl, pachctl
With these two big decisions out of the way, a few remaining notable trade offs were made on the route to our initial launch. We could have chosen either GKE or EKS for our managed kubernetes service, but the initial cluster launcher (Chet) used GKE and we had no compelling reason to switch.
Given the long-running nature of spinning up cloud resources, a decision had to be made about whether updates about the state of launching/deleting/changing clusters should be “pushed to” or “pulled by” the front-end. Pachapult had originally been built with “push” semantics, while Chet was using pull semantics. We chose pull semantics, largely because we were keeping Chet as the hub cluster launching solution.
Lastly, one of the more painful decisions we made was to shell out many of the kubectl and pachctl calls to spin up, down, and manage a running Pachyderm instance. We eventually hope to use native client libraries, but some missing features in both Kubernetes and Pachyderm made it much easier to just call kubectl/pachctl using Go’s os/exec package. (At the time of writing this blog post, a pull request is up for getting rid of all uses of pachctl and transitioning to using the Pachyderm Go client instead).
This post could easily be three times as long if we dove into the details of even more trade offs we made, but these give a slice of what was top of mind when building out a service that launches Kubernetes as a Service instances on demand.
Where We Currently Stand
Glossing over tons of details along the way, now might be a good time to look at the culmination of these tradeoffs and see where we currently stand.
When you push the “Create Cluster” button in the Hub UI, it hits the CREATE endpoint in our gin API server. Like any good CRUD app, this creates a record in a database (in our case we use Google’s Cloud SQL Postgres). Meanwhile, we have a separate background process we call our “control-plane,” which is where most of the heavy lifting happens. The control-plane iterates through every living cluster, checking its current state (i.e. has the underlying GKE “leaf” cluster been created yet, if the leaf cluster is running, has Pachyderm been deployed to it yet, etc), and moving it along a state diagram towards an eventual state where GKE is running, Pachyderm is running on top of it, Pachyderm auth has been locked down such that only the right user can access it, Istio resources have been deployed such that we publicly expose pachd (the Pachyderm daemon that you connect to when you run pachctl), and the connection to pachd and the dashboard are secured via TLS encryption. As the front-end is constantly calling the GET endpoint on a given cluster, each response returns the mostly unchanging metadata about the cluster along with some more dynamic fields such as the status of the cluster and what endpoints will give the end user access pachd (for pachctl access) or the dash for GUI access. Upon receiving this info from the backend, the front-end is able to expose this to the end user.
Deleting a cluster is very similar but in reverse. The DELETE endpoint marks the cluster for deletion in the database, and the control-plane then picks up on the fact that this cluster wants to be deleted and proceeds to sweep all the cloud resources associated with the cluster.
Each direction (creation and deletion) currently takes somewhere between five and ten minutes. The main bottleneck is the time it takes Google to provision/tear down a GKE cluster. One optimization we have in this initial release is to create a pool of pre-warmed clusters that are in a fully ready state. Upon requesting a new cluster, we simply assign the cluster to the user and the end user receives instantaneous access to the cluster.
All this is subject to change. There is a high probability that when you are reading this, none of the above is true, but at least for now, this is where we are.
It’s funny to look back on all the features that we assumed would be part of our initial launch but inevitably fell outside of its scope. Most egregiously, the ability to update clusters (add/remove nodes, modify instance type of nodes, scale bucket storage) fell to the wayside as we stalled during debates over how much control users should have into their underlying kubernetes cluster their Pachyderm instance is running on; was it worth it to give users the ability to increase/decrease/update resources or should we instead try to intelligently and automatically align resource needs based on the data and workloads being applied to the cluster? We ultimately decided to not allow any modifications to the cluster as part of our initial launch, and are already hearing feedback from private beta users that this is a desired feature.
There was also an expectation that clusters could be shared and collaborated on between various users. This was nixed when the added complexity of making clusters accessible to more than one user greatly increased the timeline of the project far beyond where we wanted to be in terms of the timeline for an initial launch.
Finally, running multiple versions of Pachyderm (and even potentially on multiple clouds) was initially in the scope of the first load testing (Chet) feature set. This is in active development as it is still a necessary feature for easily running benchmark tests, but did not make it into the initial hub offering.
With the speed at which Kubernetes seems to be taking over modern infrastructure, we are curious about what the current landscape looks like for services hosted on Kubernetes, offered as a service to end users, while abstracting away the Kubernetes layer. We are fairly certain we are not the first to try this and we definitely won’t be the last, but so far we have found little writing on the matter.
We are also curious to hear more about if others, similar to us, ate the cost of spinning up a separate managed Kubernetes cluster for every instance of the hosted service for their MVP and if they were able to successfully transition away to a multi-tenant, multiple-end-services-running-on-a-single-k8s-cluster model.
Was this the right decision, or are we dooming ourselves to a prohibitively costly MVP that would have been better off getting multi-tenancy right from the beginning at the cost of getting a product out much slower? Is the simplicity of spinning up a Pachyderm cluster at the push of a button enough to expand the user base to those that viewed managing a kubernetes cluster as non-starter to using Pachyderm? Is the speed at which the project was pushed through the door condemning current and future engineers on the project to a painful life of relentless on-call firefighting?
Go ahead and give it a try at hub.pachyderm.com and help us begin to answer these questions.