The Fragmentation of Machine Learning

By now it’s clear we’ve blown past the big hype part of the machine learning hype cycle and we’re unfortunately wallowing in the trough of disillusionment. Self driving cars have been five years away for the last decade. IBM’s Watson looks like a burned out child TV star going crazy on social media.

But failure isn’t the point of this post.

There’s always failures in any fast moving, dynamic system. Critics will tell you that means the whole thing is a failure and that’s where they always get it wrong. Early digital cameras were slow and clunky. Critics said they’d never match “real” film. Now Kodak is out of business and nobody uses film anymore.

Early failures are the very things that lead to success later. Evolution is built on the back of mistakes and early iterations. Watson was an impressive achievement for its time. The fact that it looks like a failure now only tells you how fast the times change.

But there’s no doubt we’re stuck in the mud now. After some incredible early breakthroughs like conquering Imagenet, Google’s neural translate and AlphaGo, were grinding gears in the mud.

The point of this post is to look at why ML in the enterprise is stalling and to talk about how we get it unstuck.

The short answer is simple. ML is stuck because it’s been endlessly fragmented, and sub-fragmented into small pieces that at best make for impressive demos with a vast gap between themselves and the real world. This works for discreet problems like chess and Go. ML has soundly conquered those. A more complicated game like Dota can work too, until an update comes out and the model needs to be retrained. Dota is still a far cry from the real world and already it moves too fast.

Look under the covers at the infrastructure powering these models and you’ll see the source of the fragmentation. Most companies have several competing systems that range from proof-of-concept to production infrastructure. Of course flowing between them, and through them, is the data itself.

The data is normally the most balkanized part of the whole stack, and that’s the core of the issue, because everything in ML is downstream of data.

It’s super common to find multiple copies of the same dataset from different points in time strewn around a company’s infrastructure, the same dataset fragmented into many smaller subsets. I’ve even talked to data teams who refer to themselves as “locksmiths” because they need to bypass their company’s internal security to get usable data.

To fix ML, fix the data.

That’s the work we’re doing at Pachyderm.

As Andrew Ng says, “Data is food for AI,” and as everyone knows, you are what you eat. We believe in data primacy, that is the data is the most important thing and everything else should bend to it.

Other systems get this backwards.

If we think of our data infrastructure like The Balkan Peninsula, which splintered religiously, ethnically and politically due to repeated colonization and decolonization, then our data should be like the terrain itself. The terrain doesn’t change, everything else does, but the peninsula is still a peninsula. Most systems do the opposite, the data is a secondary consideration subservient to the other parts of the system. That’s a hold over from hand coded logic systems. If you code a web login, all the logic is designed by the developer and it only touches the data when it needs a login and password. But with ML, data moves to the center. The system learns its own logic from the data. That means we can’t treat data as an afterthought. Data drives everything else in machine learning and it should drive the pipelines themselves.

For data to be primary the storage system needs 6 core properties. Miss one of these properties and you’ll eventually fragment your data just when you need it most:

Limitless: Data storage must be effectively limitless in scale. if it isn’t you wind up hitting its limit and fragment large and small datasets.

Versioned: Data changes constantly and if your storage system can’t efficiently store changes you wind up with temporal fragmentation.

Lineal: As data is transformed into new models and data, this relationship is itself data and the system must track it easily and automatically. It should not be something your team has to think about, it should just happen.

Accessible: If the system doesn’t expose the data through common data interfaces you wind up copying the data to access it in different ways.

Parallel: If the system doesn’t process the data in parallel and merge results you wind up with another runtime storage layer.

Driving: If the data doesn’t drive the processing another system does, and that system can easily get out of sync with the data itself.

No data store is truly limitless, but today’s modern object stores are effectively endless for all practical purposes.

For this reason many orgs first data infrastructure is just an s3 bucket. Object stores fail on some of the other properties, but in terms of how limitless they are it’s hard to beat. This is why modern data storage layers like Pachyderm leverage object stores as their backend and layer other features on top.

How storage size limits cause data fragmentation is easy to understand. You hit the limits so you start storing the data somewhere else. Tables that point to object store buckets, and persistent volumes floating around in your Kubernetes cluster are all signs that your team has encountered a storage limit and worked around it. Pretty soon training a model requires assembling the puzzle pieces of your fragmented storage layers. Retraining becomes impossible.

The one constant with data is change. When you say that a model was trained on a dataset you need to also be able to say which version of that dataset.

Actually you need more than that, most models aren’t the result of training on a single dataset, they’re the result of training on several. What you really want is a complete snapshot of all your data at the time of training.

Most data storage layers have something called versioning. These range from simple tags attached to datasets, object versions, snapshotted persistent volumes to full snapshots of all the data that went into a model.

At Pachyderm we’re versioning maximalists, so our snapshots capture everything and allow you to reference it as a global ID. We do this because maximal versioning means minimal fragmenting. With looser versioning systems, like the ones built into object stores, you wind up needing a constellation of versions to capture a full version, which is the beginning of fragmentation.

The other property stores must provide for robust versioning is immutability, it should never be possible for data to change once it’s recorded as a version, otherwise the record is useless because it doesn’t capture the data as it was when it was trained.

The most important use case for versioning is lineage.

Lineage answers the question “where did this come from?”

Some systems think of lineage as metadata that gets applied to datasets recording where it came from. The most simplistic version of this is a separate database where data-scientists can write metadata. This has a few major problems.

First it makes lineage information opt-in. If you forgot to write to the metadata service your lineage becomes invisible. Second, even if you do remember to write to the service it creates fragmentation as this data store is now an important part of your source of truth.

Pachyderm solves this by thinking of lineage with global IDs. All datasets are tied to a global ID which defines a version for every dataset stored within the system. These versions of different datasets are tied together by a global ID because they’re each other’s lineage. This makes it trivial to know the exact data that was used to produce a model, and also easily allows you to go the other way and ask what models were produced from this data?

The last piece of the puzzle for a data storage system is how it exposes the data to user code. Data is only useful if you can access it. If it’s not comfortable to access you wind up loading it into other systems to make it accessible. Now you have fragmented storage layers and all the fun that comes with it.

We’ve found that you can prevent this fragmentation by exposing a few common data interfaces. The big ones are filesystem, accessing data like files on your local disk, and s3, accessing files using the s3 api. This gives you a large amount of flexibility with a small surface area. Most training frameworks are designed to read from at least one of these interfaces. An s3 api also gives you an easy way to use a simpler http interface since s3 works off http.

Data processing systems need to be able to run in parallel to use more than one machine’s worth of compute resources. Different systems have different parallelization models, so it’s best for the storage system to be as agnostic as possible and offer generic ways to parallelize over the data.

Pachyderm accomplishes this with our datum model, a deceptively powerful way to express how data can be parallelized with the globbing syntax used in shells. For example the glob pattern /* means that each file at the top level of the filesystem can be processed independently in parallel. Pachyderm’s versioning features allow it to add advanced features to its parallelism model. For example the system can tell when a datum has already been processed and deduplicates the processing time. For most workloads this leads to massive compute savings and big speed ups.

Finally, the storage system itself should drive processing, rather than an external system. When an external system drives the processing your data becomes reliant on it. Models can easily get out of date, or worse, they can get created using the wrong dataset. The most successful systems are set up so that a single write kicks off the entire processing pipeline, or schedules it for the future if immediate processing isn’t desired.

When all people need to understand is how to write data to the system, everything else happens automatically, which makes it impossible to mess up. Similar to how software teams can ship a new version with a simple git push, you should be able to ship a new model with a single write to your data.

A data storage system with these properties can be the foundation of an entire ML stack. One that won’t be at best just one of many storage layers.

The road to ML fragmentation hell is paved with good intentions.

Once you’re there you’ll find it’s not only impossible to get anything done, it’s impossible to know what you’ve already done. Reconstructing your model’s path through your balkanized data infrastructure becomes a feat of deduction. And sometimes even deduction isn’t enough, such as when one dataset has been silently overwritten with another. In short, fragmentation is where ML productivity goes to die.

Smashing fragmentation is the path back to unity.