Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

The Rapid Evolution of the Canonical Stack for Machine Learning

Dan Jeffries

Managing Director @ AIIA

In order for AI apps to become as ubiquitous as the apps on your phone, you need a canonical stack for machine learning that makes it easier for non-tech companies to level up fast.

Just a few years ago every cutting-edge tech company, like Google, Lyft, Microsoft, and Amazon, rolled their own AI/ML tech stack from scratch. 

Fast forward to today and we have a Cambrian explosion of new companies building a massive array of software to democratize AI for the rest of us. 

But how do we make sense of it all? In order for AI apps to become as ubiquitous as the apps on your phone, you need a canonical stack for machine learning that makes it easier for non-tech companies to level up fast.

Join us in this webinar as we cover:

  • What are the components for true MLOps
  • How do teams begin their journey into AI and Machine Learning
  • Why teams should take a data first approach to ML

Webinar Transcript

Peter: Now, without further ado, I'd like to introduce our next speaker, Daniel Jeffries. Dan is the Chief Technology Evangelist at Pachyderm. He is also an author, engineer, futurist, pro blogger, and has given talks all over the world on AI and cryptographic platforms. He spent more than two decades in IT as a consultant and that Red Hat. With more than 50,000 followers on medium, his articles have held the number one writer spot on medium for artificial intelligence, Bitcoin, cryptocurrency, and economics more than 25 times. His breakout AI tutorial series, Learning AI If You Suck at Math, along with his explosive piece on cryptocurrency, why everyone missed the most important invention of the last 500 years, and why everyone missed the most mind-blowing feature of cryptocurrency are shared hundreds of times daily and has been read by more than five million people worldwide. Please give a warm welcome to Dan, who will be giving us a talk on the rapid evolution of canonical stack from machine learning.

Building the Default Machine Learning Tech Stack

Dan: Well, thanks for having me, Peter. Excited to get started for everybody. So today I'm going to talk about the rapid evolution of the canonical stack for machine learning. So what do we mean by canonical? We mean by the default stack that everyone turns to overtime and that that they turn to when they want to set up any new type of environment for their enterprise. So in the past, we've had only the big kind of shops like Lyft Google, Uber, kind of building these artificial intelligence, machine learning platforms. And they were the only ones that had the mindshare and they know how to be able to do this. And that's primarily because these platforms are radically different than the platforms that we've had in the past. And although we've had a number of different solutions that have kind of developed over the last 30 years in traditional coding and we've gotten very good at these kind of fast releases, breaking it down to sprints, breaking it down into large distributed teams over the course of the world, some of those templates apply. And some of them just don't apply to the machine learning world as we're going to see just a little bit later.

And if you wanted to start to get into the artificial intelligence, machine learning space, in the last few years, you were going to have to roll your own software and write it yourself if you wanted to compete with the big folks. But right now, in the last few years, we have this sort of cambrian explosion of new companies, projects building this kind of massive array of software.

Defining Software in a Rapidly Changing Landscape

For us, the real challenge is trying to get your head around all. But it's very complex to understand that we've seen a number of different articles that come out that try to give us a whole landscape of what MLOps looks like, what artificial intelligence looks like, and sometimes those articles. Even with the best intention, just make things infinitely worse. If you take a look at something like this graphic, which tends to show up in almost every one of these articles without fail, I call it the NASCAR slide, which come from marketing world, this is where the author comes up with the 85 categories of machine learning and the 500 logos that cleanly fit into every one of these different categories. Of course, every marketer loves to make this slide, except nobody actually understands this slide. It doesn't really make any sense. Nobody's logo cleanly fits into any of these things that we see these sort of replicated over and over again because they're kind of easy first choice for people. You go grab a logo, you make up the categories. And so somebody is associated with computer vision, but maybe they're not associated with NLP. Somebody is associated with serving, but they're not associated with training. Meanwhile, what we start to see is that these platforms are really overlapping, and they're also reaching into a lot of different areas of the stack as it develops. So these guides are really unpredictably unhelpful.

And I just read one recently, a three-part post in the MLOps space. And in many ways, it was actually incredibly well written. And the challenge, of course, was trying to understand each of the pieces of software out there. Sometimes it's a problem with the marketing in terms of those different platforms, and sometimes it's just a problem trying to do all the research and get your head around all of it. In addition to serving as a chief technology evangelist at Pachyderm, I'm also the Managing Director of the AI Infrastructure Alliance, so the AIIA for short. And that's 60-plus companies across the AI infrastructure space and in the solutions and data space. So I talk to a ton of these companies all the time. I seek their CTOs, CEOs. They give me demonstrations of the platform. And even it's a challenge for me sometimes keeping my head around it. So in this example, in the MLOps space, Pachyderm just gets pigeonholed as data versioning, but really, Pachyderm is a robust data versioning, data lineage engine for the entire lifecycle from ingestion to training. And it's a partial metadata store and it also does an entire kind of data engineering orchestration engine. So there's a number of different things that it does effectively, but really, it just says data versioning. And it does so much more than that. So I just happen to know this particular one very well.

ML Tooling Defies Categorization

I also happen to know a number of other companies well, and I can tell you that a lot of them get pigeonholed incorrectly, so I end up working with the different groups out there to bring clarity to the space. It's very challenging. I also think that there's a huge challenge with this sort of ML maturity journey. So a company starts to hire somebody and maybe they are a single data scientist and they throw a laptop at them and they say, "Go ahead. Set up your own infrastructure, download whatever you want and make some AI magic and make us money." And of course, this really doesn't scale. What ends up happening is, the team has to mature over time. And how do you get your team that has 50 or 60 different data scientists, a number of data engineers, they have different roles across the organization, there's different data sources that are being pulled in? Maybe you're pulling in from Redshift and Snowflake, plus a bunch of data lakes out there and some NFS stores. You're trying to clean this data, transform it, get it all into a common format. You're threading the RBAC needle across the board trying to make it a reality, and of course, it's incredibly challenging to get to this kind of high-release velocity, incredibly challenging to get across team sharing, full automation. So there's this kind of tools that sort of fit at the very beginning when you're just exploring. You download whatever version of Python library you need, but then when you start putting these things in production, even a slight change in that library can make a huge difference and have a tremendous ripple effect. Even something as simple that you may not see as how a random number is generated, or a sort, is done by default can have sort of-- butterflies start to typhoon in China kind of level ripple effects across the thing.

So you need to really start to have these tools that are incredibly complex. And starting to feel that, is all the companies that are coming to the market now. And again, getting your head around this is really challenging. What we try to do at the AIIA, the Infrastructure Alliance, is really start to give people an understanding of what the different tools look like, and where they fit in a more comprehensive way, versus the NASCAR slide. Now, one of the simplest things that we tend to see is this kind of four-step model of preparation, experimentation, training, and deployment. Your data collection is in there, your exploration, you're putting it out on some GPUs, you're doing hyperparameter tuning, you're deploying the model, you're doing inference, you're monitoring. There's also a looping effect that's happening in that in different stages in the experiment training, you're constantly looping back and forth, once you go to deployment, a ML is never really done in your looping back, and so there's this constant movement or a loop. It's often portrayed as something that's just simply linear from start to finish, but in reality, it really is a machine learning loop. Even so, a graphic like this still doesn't kind of tell the entire story of how this works in reality. So one of the graphics we've started to do with the blueprint committee at the Infrastructure Alliance is, a number of different companies, projects working together to come up with something, is to come up with a few different types of graphics that kind of portray a little more nuance and complexity to this kind of emerging stack that's happening. And we tend to think of this as what we call like a canonical stack for artificial intelligence, and that is again, the default stack that everyone uses. You want to think of something like a LAMP stack, or a mean stack for artificial intelligence and machine learning.

So what you tend to see, again, in the beginning of any development system is this huge kind of explosion of different companies, and then there's consolidation, and some companies don't make it, some companies get bought, and people start to standardize on a solution that works really well, and you start to see these things kind of work together over time; so a canonical stack. And we're starting to see a better picture of it the high up because we have so many different companies that we interact with on a different basis. So my colleague Jimmy Whitaker at Pachyderm, a great data scientist, likes to say that, "All diagrams are wrong, but some are useful." And what he means by that essentially is that, no diagram is going to capture everything perfectly, and so you have to take a perspective of what you want that diagram to show. And in this particular case, we decided that we wanted to show a time series workflow, where people are in these things. So what we tend to see is, a lot of the tools end up on the deployment and training stage because these are tools where we already have kind of an understanding of things from the past. We have a template from Agile, and from traditional development of hand-coded logic. And so those things we need to build upon, but we don't spend most of the time there. In fact, these boxes are larger. Instead of them all being equal like they are in the last slide, they're based on the amount of time that you spend there. That could be any amount of people time, which you spend in the data stage. Most of the giant solutions integrators that I end up talking to on a day-to-day basis tell me that they spend a lot of their time just on the ingestion; writing these scripts in the data stage, in the cleaning, in the validating, transforming.

Before they even get to the labeling, they're pulling from 20 different data stores and threading this RBAC sort of maze in order to get there, maybe there's synthetic data that needs to be generated. It's getting it all into the correct format. Do you have zip codes that have the extra four digits in it, or not? Do you zero them out, or do you add them to the other ones? Do you validate that there's no corruption in there? All of these kinds of things. You spend a ton of manual time in this data stage. I say we generally have 80/20 rule. 80% of our time is kind of spent on these early stages, 20% on the deployment. But 80% of the tools tend to be in this space and almost nothing over here, and that's because this is very hard. We don't have a lot of analogs to it in the traditional hand-coded logic space. So in a traditional hand-coded logic, if I write a login for a website, I decide all the different steps in the decision tree, I write the logic myself, and I only touch the data once or twice; when I get the username and password and validate it. But in the machine learning model, the data's center; the versioning of it, the lineage, the ingestion, the transformation, the labeling. All of these types of things make a tremendous amount of difference, and we spend a huge amount of time and there's just no analog to it in traditional hand-coded logic. It's the center stage where the machines learn the features themselves. In other words, we can't teach the cat-- we can't hand-code the logic to the machine to recognize the cat, but we can teach the machine to figure out itself what a cat is.

MLOps Categories are Still Forming

And so we spend all this time over here and in fact, we get most of the bang for the buck right here within this stage. This is what I call the data engineering orchestration, which is separate from the data science orchestration pipeline where you tend to be more focused on the experiments, the iteration, which algorithm you're using. This is more of the clean transformation training. There's also a ton of time that's spent on the training stage, which again, has no parallel in the traditional hand-coded logic stage. That's not necessarily a lot of human time that's spent there, but a lot of machine cycle time on the way there. You're putting it on big burly GPUs or TPUs, you're waiting for an answer to come back, you're experimenting with different versions of a model and different hyperparameters, trying to figure out which one is the best to choose, and you're kind of looping back to the original bits that are in there. And so these are where we spend a lot of this time here in the machine learning loop. And so that's what we wanted to illustrate in this particular thing.

Now, what's interesting is, you can't just put a particular logo on this and say, "Great!" I just stacked the logo and they fit in one area of this. And so what we want to start to demonstrate is, how does a particular company, through color coding, interact with different parts of this time series workflow? So in this case, if it has a dark color - this is one of the partners in the AI Infrastructure Alliance, Seldon - which focuses on serving engines and monitoring, drift detection, anomaly detection, dashboarding, etc. If you have the dark color, it means it's sort of a complete support, enterprise-level support, and then a partial support. In other words, not that it's beta or experimental, although it could be, but that you may actually have multiple monitoring engines within the space. In other words, you might have monitoring that covers deeply into the training, all the inference, how things are performing over time in terms of your training, etc. Whereas one group may focus on just a smaller part of the monitoring engine, just whether the model is up and running, which version it is, etc. And so that's why we wanted to use these newer types of diagrams to help people better understand and contextualize where these different companies and projects fit. An example of Pachyderm is we're very heavily focused on the data engineering side of the house, ingestion. We don't do label, we can work with labeling, and we can augment that labeling with versioning and lineage, which if you're just a single sort of data scientist, maybe it doesn't make a difference, but it makes a big difference when you have 30 or 40 different folks working on that over time and you need to get back to the exact version that you needed in the past.

We're also, in many ways, a partial metadata store because when you look at it, we're a copy on my file systems, so it's taking infinite snapshots over time and it's using a Lignite Git-like system to track what is in those snapshots as your models, your coding, your data are all changing simultaneously. So you see now how in each stage-- but that doesn't mean, for instance, that we have metadata across the feature storing, the monitoring, and all the other things that you might put into a metadata store in addition to that, so that's why we have partial coverage there. And so in this case, you start to see how the coloring fits together and more naturally flows across the diagram, which gives you a better understanding of how it all works together as opposed to trying to kind of neatly fit a logo into random set of categories.

Now, let's take a look at another famous graphic. So this graphic is often cut and pasted into people's presentations or into someone in enterprise that's looking for some type of machine learning solution and they say, "Yeah, great, you go through this model and tell us where you fit." And so this model comes from Google and it's in one of their public documents on building a machine learning model. But maybe it's a little small on there, but can you spot what's sort of missing in here? And the answer, I'll give you the answer pretty quickly here, is that data is missing from this. First of all, it's also a bit of an eye chart, we don't really know-- I don't kind of know where things flow or why flow from one thing. But there's no data in here. Why is there no data in here? Well, Google created it. And Google has a planetary level file system that they've made for themselves and universal RBAC across those systems so they don't have to think about where the data is coming from, it is a given. But that's not true for you as an enterprise or a company trying to figure out how to do machine learning in production at scale, how to do MLOps in production at scale in your own organization. You do have to think about the data. Where is it coming from? Is it coming from these databases? Which databases? How many databases? Is it coming from a data lake, NFS store? How many of those data lakes? How big is your storage capacity? Is it sitting in the cloud? Totally in the cloud, partially in the cloud, a hybrid of the two? How do you unify them if it's in a hybrid?

And so what happens, unfortunately, is just like the NASCAR slide gets replicated over and over again because it seems like a good idea on the surface and there's no critical thinking that's happening to ask people whether this makes any sense, the same thing happens with these types of slides. And so somebody's in procurement, grabs a snapshot of the slide, sends it to you and really, it doesn't make any sense for what it is that you are doing. So we also have created-- this is probably our most well-known graphic in the space. And again, it may look like an R chart in the beginning, but it's very logical when you start to dig into it. So this is a blank version of the chart where we've taken an abstraction of the types of pieces of modern machine learning pipeline. And this doesn't matter whether Uber built it themselves, or Lyft, or Google, or whether you've built your own best-of-breed solution with different IA solutions, or what the cloud providers are trying to emulate, although in some respects poorly by trying to build it into engine. Nobody truly has a complete end-to-end engine. Don't believe the marketing hype. I don't believe that everybody can do every single thing on this. You're going to need a number of different pieces or Lego bricks for many years to put together a true, wonderfully scalable machine learning platform for your data scientists and data engineers.

DataOps Can't Self-Organize Overnight

So in this case, well, we've broken it down into notebooks. So these are going to be your Jupyter Notebooks, your dashboards, or where you're getting additional information about inference, monitoring, those types of things. There's, again, a data science experimentation pipeline, and that tends to be where the data scientist lives; looking at the different images that are in there or the different structured data, moving it around at a higher level, trying to understand how to extract features out of it. You might have a feature store which tends, primarily, to be structured data. That tends to get-- again, people read about a new thing, there's a feature store, boom, they slam it into their architecture without fully understanding what it is. You're generally not-- there are feature storage that can do unstructured data, but in general, they don't because it doesn't have anything-- it's not meaningful. A data scientist is going to look at the 20 different features in a structured database and they're going to be able to interpret it by looking at it but if you're just looking at a string of numbers in a vector, it doesn't mean anything to you. That means something only to the machine as it's doing visual recognition or NLP, those kinds of things. So that's a use case dependent. We put a use case dependent here. You might have synthetic data. In this case, again, data engineering engine and experimentation engine, a training engine, a deployment engine, a serving engine, these are discrete pieces. They may exist, again, in multiple parts or even in the same binary of a single software platform, but they may not.

In another case, you also have something like monitoring. Monitoring, if you notice, can kind of go from the experimentation all the way to the serving; In the inference. You have a logging engine that's feeding that. But the data engineering orchestration pipeline also kind of goes from here to data engineering to deployment, ingestion to deployment. And this is, again, the kind of low-level plumbing type of orchestration engine. It's very different from the data science experimentation where you're thinking about algorithms and features and building your model and how well it's performing versus I'm getting the data, I'm changing it, I'm turning it into a different format. I'm standardizing, whether it's what that format is. I'm compressing it. I'm moving it from one data storage area to another. I'm keeping track of the lineage of that-- again, these tend to be two very different things. And so we've kind of built this out. And then what's even more interesting is not just how an individual company or project flows over that diagram, but how you can take and build a complete stack for two to five different projects or companies of best-of-breed solution. So in this case, I've taken ClearML, Fiddler, Tecton, Pachyderm, and Seldon - five members of the IA, five founding members of the IA - and we've colored in where they fit and do their best work. Noticed, sometimes you've got overlap.

So in this case, both Pachyderm and ClearML have Notebook support. And in those Notebooks, do various things. Whereas ClearML is very strongly focused on the data science experimentation and the workflow of the data scientists, Pachyderm is very much focused on the data engineering engine and pulling in all that different data and moving it around, etc. Now, we could probably, again, kind of cross-color the experimentation engine and training engine, but in this case, we left it with ClearML because it's sort of a superior bit and keeps track of it at a higher level. And they could also, for instance, be involved in deployment. But we want to-- again, need the best of breed. Who's doing the best work here? So Seldon's deployment engine, their model repository and their serving is the key. A number of these platforms may go to the logging engine, Pachyderm, ClearML, but we've left that to Fiddler in this case, which is able to do extensive monitoring really across the entire thing. Metadata store, again, is a combination, that's why we have a gradient, ClearML and Pachyderm. We can probably also add Tecton in there. Tecton is a robust feature store engine. They're also kind of the folks behind the Feast open-source project, but their enterprise product brings a number of different things together that are kind of different than the open-source project in that they abstract out a lot of the communication to these backend engines. And so they have the dominant structure in the feature store here.

We don't color in something like your cloud infrastructure, your Kubernetes, your RBAC, your object stores, your infrastructure-as-a-service engine; that's where your cloud is or your data center. We don't color in the external data sources. You could maybe add a sixth engine and add a labeling engine, model testing and validation framework, it could be something like that. You could have security and compliance. So what we really wanted to demonstrate is just how much nuance and clarity you can bring to a subject when you think about it clearly, when you think about it with critical thinking, when you look and have a broad kind of understanding across the entire space. And that's really what the AI Infrastructure Alliance is trying to do. We're trying to bring clarity to it. We're trying to unify the entire industry with a common set of understanding. As it develops, we want to influence a bit of the waves and we want to surf the waves as this kind of Cambridge explosion happens. And we want to help people get a better grip on these things as they're thinking about it because you don't want to be thinking about it an ad hoc project and building these things as a one-off. Uber, when they built their early Michelangelo project, they thought they were building a universal engine, but they ended up building something that was really only good for the thing that they programmed it for and it didn't really move beyond that.

And in a lot of ways, what ends up happening is, too many companies come in there and they spin up a couple data scientists, they grab one or two open-source projects, and they do one project. Then they get another-- Harvard department gets another set of data scientists and they spin up their own infrastructure. And all of a sudden, you've got this crazy, nonstandardized, noncommunicating set of projects, bespoke projects, and architectures for lots of different things. Companies should start thinking about this for the next 5 or 10 years. They want to be looking at the larger big picture. Looking at the best-of-breed solutions. Building a stack. There's going to be some integration work, you're not going to buy anything off the shelf no matter what the marketing hype tells you. If you look at those kinds of end-to-end solutions, like a SageMaker or whatever, they don't really have verging and lineage. They tend to only be structured data no matter what they say. The parts of it don't really work with the data wrangler, it doesn't work with this. So you're really going to end up doing a lot of integration yourself, whether you think so or not. So the best idea is really to take these best-of-breed solutions, put it together, and test them out, and build something that can really last you for the next decade or more, the next 5 or 10 years. And then unify around this plan. Understand your business value, understand what you're trying to achieve across your different teams, and work together to build something that really matters. That is all. Thank you so much for your time. Again, I'm Daniel Jeffries, Managing Director of the AI Infrastructure Alliance, that's AI-infrastruture.org. I'm also the Chief Technical Evangelist for Pachyderm.