Treat Data with the Rigor of Code by Building Datum-Centric Pipelines

“Data” is a curious word. It’s the plural form of “datum,” but people mostly use it as singular. The phrase “the data are conclusive” sounds a bit weird, even though it’s technically correct. “Data” sees this type of use because the singular form is unwieldy and rarely used. Many native English speakers don’t even know the word “datum.” To further complicate matters, to talk about a “datum,” you have to know what defines a single indivisible unit of data, which means you have to understand what you’re using the data for in the first place. 

People get by just fine without using “datum” in everyday speech, and everyone generally understands what they mean. 

However, the data centric AI movement put data at the center of AI/ML by bringing the rigor of code to the data. To have this conversation, we need to get a lot more precise than when talking about data in the everyday world.  Code is rigorous because we describe it precisely. We don’t have to settle for asking, “What does the code do?” we can ask “What does this function do?” or even “What does this line do?” If we want rigor for data, we must first describe it rigorously. 

Rigor demands precision.

The concept of a datum is the foundation for a deeper understanding of data.

So, what defines a datum? A datum is a single indivisible unit of data. The tricky thing is that indivisibility isn’t an intrinsic property of the data itself, but of what you’re doing with it. From a pure data perspective the smallest indivisible unit is a bit. But even in low level assembly, this is too small to be a useful datum – typically, a 64 bit integer would be our datum there. 

When it comes to AI, the definition of “datum” depends on how you train your model. 

If you’re training an image classifier, then a single image is your datum. That’s the minimum amount of data you can run the training process on. If you’re training on videos, your datum is a video clip. No matter what you’re doing, some partitioning of the data will constitute a datum. Sometimes you need all of the data to do something useful, this is frequently the case if your data represents a genome. In that case, the whole thing is one big datum.

Once you understand what constitutes a datum in your workflow, what can you do with it? The first thing it gives you is a way to understand how your data hooks up to computation. You can ask questions like:

  • How many datums do I have?
  • How long does a datum take to run? 
  • How many new datums do we get every day? 

This definition gives you a rigorous way to understand the relationship between your data and what you want to do with it. The magic of thinking about data this way is that it’s rigorous enough that the underlying system can execute it and, depending on how sophisticated it is, automatically optimize it.

A rigorous datum model gets more value from your infrastructure.

Every system that processes data has some notion of a datum, often given another name. Sometimes, it’s spread across multiple concepts or has other things mixed into it. Pachyderm’s approach to pipelines and versioning for MLOps treats datums as a core concept, and demonstrates the benefits a system can get from that rigor. 

A datum in Pachyderm is a single, atomic unit of work. More precisely, it’s all of the data (including the code) necessary to perform a single, atomic unit of work. Because Pachyderm approaches workflows at the datum level, how to run the computation has natural answers in those same terms.

Parallelization is the most basic question when running a workload at scale. There’s only so much you can do on one machine, even the largest machines available. Because a datum is a single unit of work, it can be processed in isolation. This has a powerful scalability benefit: When any datum can be processed in parallel with the others, the work can be allocated to any available machines. It also allows users to reason about how their work will be parallelized. For example, If a datum takes an hour to run, 100 datums will take 100 hours on one machine, 50 hours on 2 machines, and so on. At 100 machines, you’re at capacity – the 101st machine won’t have any work to do. 

What happens when you add a new datum to your data

What happens tomorrow when you want to run it again with more data? This is where the datum model really shines. 

Let’s say you have a set of images of dogs wearing outfits for a computer vision model identifying the next big trend in canine apparel. Since you are forecasting trends, updating your dataset with the most up-to-date fashion examples will be important. Last year’s pastel dog hoodies have been processed by the model already, but in summer 2022 it’s all about dog backpacks. 

If your pipelines only see data, you’ll be processing all of your example photos again when you add the latest looks to generate a new forecast. 

With a datum centric approach, you’ve defined the indivisible aspect of your dataset – one image of a dog in an outfit. With lineage and versioning for this dataset, it is a simple process for your system to only fetch and process the new datums, without spending processing time on previously analyzed datums. Only processing the files that are needed means more free cycles, so the cat fashion trends department can process their new datums at the same time as yours. 

For workloads where datums are added and changed in small amounts over time, this leads to massive performance improvements saving time and money. In essence, the rigor of the datum model allows the system to automatically turn any workload into a streaming workload.

Building datum centric is the key to operationalizing ML

But there’s one other benefit we should talk about here. More than anything, what you get with precision is a better understanding of the data itself. A datum model opens operational and efficiency doors once you internalize that processing data means processing datums. 

We applied the datum approach to dog photos above. What if there are corrupt files, or some of your files are videos instead of images? 

Then there’s an error processing one or more of the datums. Instantly, the datum model can zoom you in on the actual error. It can tell you specifically which datum(s) errored, so you can figure out how to fix it. Maybe this datum exposes an edge case in your code, good news that datum is now a unit test for that edge case. Maybe your code is fine but there’s something wrong with this datum, now you know what you have to fix.

Data centric AI is the path forward for AI over the next decade. Data is the foundation. But the more rigorous and code-like we are with our approach to data, the more solid our foundation to build powerful systems on top of it. It’s the difference between thinking about code generally and thinking in terms of functions. Every engineer has experienced that magic moment when you find the right abstraction for a function that allows you to build on top of it as a strong foundation. 

The datum model delivers the same powerful base to hold up the AI/ML models of today and tomorrow.

Want to understand how Pachyderm enables data centric ML applications? Take a look at our case studies