By now it’s clear we’ve blown past the big hype part of the machine learning hype cycle and we’re unfortunately wallowing in the trough of disillusionment. Self driving cars have been five years away for the last decade. IBM’s Watson looks like a burned out child TV star going crazy on social media.
But failure isn’t the point of this post.
There’s always failures in any fast moving, dynamic system. Critics will tell you that means the whole thing is a failure and that’s where they always get it wrong. Early digital cameras were slow and clunky. Critics said they’d never match “real” film. Now Kodak is out of business and nobody uses film anymore.
Early failures are the very things that lead to success later. Evolution is built on the back of mistakes and early iterations. Watson was an impressive achievement for its time. The fact that it looks like a failure now only tells you how fast the times change.
But there’s no doubt we’re stuck in the mud now. After some incredible early breakthroughs like conquering Imagenet, Google’s neural translate and AlphaGo, we’re grinding gears in the mud.
Why is enterprise ML stalling, and how do we get it unstuck?
The short answer is simple. ML is stuck because it’s been endlessly fragmented, and sub-fragmented into small pieces that at best make for impressive demos with a vast gap between themselves and the real world. This works for discrete problems like chess and Go. Machine learning has soundly conquered those. A more complicated game like Dota can be tackled with ML too, until an update comes out and the model needs to be retrained. Dota is still a far cry from the real world, and already it moves too fast.
Look under the covers at the infrastructure powering these models and you’ll find fragmentation.
Most companies have several competing systems that range from proof-of-concept to production infrastructure. Of course, flowing between them, and through them, is the data itself.
The data is taken for granted. It’s seen as moving from one tool to the next instead of a part of the whole stack, and that’s the core of the issue. Everything in ML is downstream of data.
It’s super common to find multiple copies of the same dataset from different points in time strewn around a company’s infrastructure, the same dataset fragmented into many smaller subsets. I’ve even talked to data teams who refer to themselves as “locksmiths” because they need to bypass their company’s internal security to get usable data.
To fix ML, fix the data.
That’s the work we’re doing at Pachyderm.
As Andrew Ng says, “Data is food for AI,” and as everyone knows, you are what you eat. We believe in data primacy, that is: the data is the most important thing and everything else should bend to it.
Other systems get this backward.
If we think of our data infrastructure like an island, then features and models will change like the villages and towns built on its ground. Our data should be like the terrain itself. The terrain doesn’t change. Even as houses and farms are built, the island is still an island. Most systems do the opposite, relegating data to a secondary consideration subservient to the other parts of the system. That’s a holdover from hand-coded logic systems.
If you code a web login, all the logic only touches the data when it needs a login and password. But with ML, data moves to the center. The system learns its own logic from the data. That means we can’t treat data as an afterthought. Data drives everything else in machine learning and it should drive the pipelines themselves.
6 Core Properties of Data-Centric Storage Systems
Miss one of these properties and you’ll eventually fragment your data just when you need it most:
Property 1: Limitless Storage
Data storage must be effectively limitless in scale. if it isn’t, once you hit a storage limit, moving to other platforms fragments your system into large and small datasets.
No data store is truly limitless, but today’s modern object stores are effectively endless for all practical purposes.
For this reason, many orgs’ first data infrastructure is an S3 bucket. Object stores fail on some of the other properties, but in terms of how limitless they are it’s hard to beat. This is why modern data storage layers like Pachyderm leverage object stores as their backend and layer other features on top.
How storage size limits cause data fragmentation is easy to understand. You hit the limits so you start storing the data somewhere else. Tables that point to object store buckets, and persistent volumes floating around in your Kubernetes cluster are all signs that your team has encountered a storage limit and worked around it. Pretty soon training a model requires assembling the puzzle pieces of your fragmented storage layers. Retraining becomes impossible.
Property 2: Automatically Versioned
Data changes constantly. When you say that a model was trained on a dataset you need to also be able to say which version of that dataset. Your storage system must efficiently store changes to data, lest you wind up with temporal fragmentation.
Actually you need more than that, most models aren’t the result of training on a single dataset, they’re the result of training on several. What you really want is a complete snapshot of all your data at the time of training.
Most data storage layers have something called versioning. These range from simple tags attached to datasets, object versions, snapshots of persistent volumes, to full snapshots of all the data that went into a model.
At Pachyderm we’re versioning maximalists. Our snapshots capture everything and allow you to reference it as a global ID. We do this because maximal versioning means minimal fragmenting. In looser versioning systems, like the ones built into object stores, you wind up needing a constellation of versions to capture a full version, which is the beginning of fragmentation.
For fully robust versioning, the other necessity is immutability. It should never be possible for data to change once it’s recorded as a version. The record is useless if it doesn’t capture the data as it was when it was trained.
The most important use case for versioning, of course, is lineage.
Property 3: Lineage, not Metadata
As data is transformed into new models and data, this relationship is itself data. The system must track it easily and automatically. It should not be something your team has to think about, it should just happen.
Lineage answers the question, “where did this come from?”
Some systems think of lineage as metadata that gets applied to datasets recording where it came from. The most simplistic version of this is a separate database where data scientists can write metadata. This approach has some critical drawbacks.
First, it makes lineage information opt-in. If you forgot to write to the metadata service your lineage becomes invisible. Second, even if you do remember to write to the service, it creates fragmentation: a second data store is now an important part of your source of truth.
Pachyderm solves this by tracking data lineage with global IDs. All datasets are tied to a global ID, which defines a version for every dataset stored within the system. These versions of different datasets are tied together by a global ID because they’re each other’s lineage. This makes it straightforward to answer the exact data that was used to produce a model, and also easily allows you to go the other way and ask, “What models were produced from this data?”
Property 4: Data Accessibility
The last piece of the puzzle for a data storage system is how it exposes the data to user code. Data is only useful if you can access it. If it’s not comfortable to access, users will resort to uploading it into other systems to make it accessible. Now, you have fragmented storage layers and all the fun that comes with it.
Data engineering teams can prevent this fragmentation by exposing common data interfaces. This can be accomplished via filesystem, accessing data like files on your local disk, and accessing files using the S3 api. This gives users a large amount of flexibility with a small surface area, and most training frameworks are designed to read from at least one of these interfaces.
Property 5: Process in Parallel, Merge Results
Data processing systems need to be able to run in parallel to use more than one machine’s worth of compute resources. Different systems have their own parallelization models, so it’s best for the storage system to be as agnostic as possible and offer generic ways to parallelize the work.
Pachyderm accomplishes this with our datum model, a deceptively powerful way to express how data can be parallelized with the globbing syntax used in shells. For example the glob pattern /*
means that each file at the top level of the filesystem can be processed independently in parallel. Pachyderm’s versioning features allow it to add advanced features to its parallelism model. For example the system can tell when a datum has already been processed and deduplicates the processing time. For most workloads this leads to massive compute savings and speed gains.
Property 6: Let the Data do the Driving
If the data doesn’t drive the processing another system does, and that system can easily get out of sync with the data itself. The most successful systems are set up so that a single write kicks off the entire processing pipeline, or schedules it for the future if immediate processing isn’t desired.
When an external system drives the processing, your data becomes reliant on it. Models can easily get out of date, or worse, they can get created using the wrong dataset.
Once your team understands they only need to know how to write data to the system, and everything else happens automatically, they have more confidence and ability to take control over their projects, without the fear they’ll mess up. Similar to how software teams can ship a new version with a simple git push, you should be able to ship a new model with a single write to your data.
A data storage system with these properties can be the foundation of an entire ML stack. One that won’t be – at best – just one of many storage layers.
The road to ML fragmentation hell is paved with good intentions.
Once you’re there you’ll find it’s not only impossible to get anything done, it’s impossible to know what you’ve already done. Reconstructing your model’s path through fragmented data infrastructure becomes a feat of deduction. And sometimes even deduction isn’t enough, such as when one dataset has been silently overwritten with another. In short, fragmentation is where ML productivity goes to die.