Managing Director @ AIIA
Data-centric AI marks a dramatic shift from how we’ve done AI over the last decade.
Instead of solving challenges with better algorithms, we focus on systematically engineering our data to get better and better predictions. But how does that work in the real world?
It’s one thing to define data-centric AI but it’s another thing all together to make the shift to a data centric approach.
In this talk I’ll walk you through how to solve problems right in the data, with data augmentation, synthetic data, re-labeling and more. You’ll learn how to shift your mindset to creatively solving problems in the data instead of looking for magical fix from a yet another new model.
Fabiana: Hi, everyone. Great to have you all here today. I'm Fabiana, Chief Data Officer at YData and also one of the founding members of the Data-Centric AI Community. And I have here with me Dan Jeffries, the chief technical evangelist at Pachyderm and also managing director of AI Infrastructure Alliance, to talk us through what is practical data-centric AI. Hi, Dan.
Dan: How are you doing? I'm happy to be here with you. So let's have some fun today.
Fabiana: Yes. Yes, for sure. And happy to be, again, on a webinar, especially around the topic that I-- well, it's so dear to us. And maybe we can kick off exactly with that. What is, for you, data-centric AI? Or how would you describe it, to be honest?
Dan: I mean, we talked about this a bit in the presentation. I think we all know Andrew Ng's thinking about this. But for me, data-centric AI is threatening to surge into buzzword without meaning. And I think it has-- but I think it actually has real meaning, right? But it has to be something that guides how people do things in the real world, right? And so, oftentimes, we kind of see sort of vague proclamations of data-centric AI, right? But there's actually no steps that anybody can actually take to do it, or what does it mean, right? It's a bit like-- it's a bit like saying the word God, right? It's like a Muslim, an atheist, a Christian, a Buddhist, they all hear something different, but we all think we're talking about the same thing. So it's a [double set?]--
Fabiana: Join this movement and [regret?]. Yeah, yeah, yeah. [laughter]
Dan: Yeah. That's right. But it's totally different, right? Really data-centric AI is about focusing on the data itself and solving the problems there. And when we look at the vast history of artificial intelligence, it's all been focused around the model. If there's a problem in the data, the data's noisy, if there's missing pieces of the data, make the model better, change the model, tweak the hyperparameters, do all of the work there. It was sort of assumed that the data would just be imported. It would be perfect. It's like Boston housing prices, right? And then you just go, and it's already perfectly done.
But the real world doesn't work like that. You're pulling data from 10 different sources, enriching it, flipping it around, running it through a transcoder, whatever. There's 20 different things that need to happen. And then, from that standpoint, there's also noisy data and data that needs to be imputed, it's missing, all this kind of stuff, right? It needs to be transformed into another format. There's tons of these things that need to be done. So it's really a focus back and where we end up spending 90% of the time anyway. And that to me is really what data-centric AI is.
Fabiana: That's interesting, especially one of the things that you mentioned. Of course, we have this concept of garbage in, garbage out for quite some time. We all heard about it. But we also hear the assumption that if you iterate the model, the problems that it has will just go away or somehow will be mitigated by the model itself. So I just sense that we have here two different paradigms. So for the ones that are hearing us, how do you describe or how do you compare themselves? So side by side, the biggest difference between both of them.
Dan: I mean, we talk about-- I mean, we've got a good chart with that right in here, right? And if we look at that-- and that came from Andrew Ng's model-centric-- a chat with Andrew is one of the slides in there, so I credit Andrew with it. But I thought it was a good example of, in model-centric, you work on the code as your central objective.
In data-centric, you're working on the data as the central objective, right? In the model, you're optimizing the model so it can deal with the noise. But in the data-centric, you don't just gather more data. You try to fix the noise in the data itself, right? If the labels are off, in a model-centric, you try to work around it and assume that some of the datas are screwy. In the data-centric, you go back and clarify your instructions and fix the labels, right? So it's really a-- I actually think it's more of a data engineering-focused approach to things. If you look at-- and it can help bring in kind of a lot of IT folks, programmers, the people who would've been outside of kind of data science, right, other than as kind of the IT infrastructure providers to the data scientists, right?
At Pachyderm, there's an example. I think it was at Liveperson where the infrastructure folks were using Pachyderm to kind of process all of the audio, transcribe it. And they were doing it all in parallel. It took, I don't know, something like seven weeks to process all the new audio that would come in. And with Pachyderm, they will do it in seven hours or so. That was a huge kind of move that allowed-- but basically was a hand-off, right? And then the data scientists would do what they do with it. They'd be looking for features and iterating on the models and all that, but they kind of missed that whole part.
Dan: I think data-centric AI is kind of unifying these two aspects of things. It's getting those teams to work closer together. It's getting them to creatively think about the problems together and to not see it as something where you just, "Oh, well, we process the audio. We throw it over the fence to the data scientists." And the data scientists are like, "Cool. I don't know what that is. That's a black box to me. I assume the data's correct, and I'm just going to then iterate it on the model."
As AI grows in the real world, just like we saw with DevOps in Agile and everything else where you start to see these crossover teams where you need to know a little bit about programming. You need to know a little bit about systems infrastructure. You need to know a number of different kind of toolsets to do things. And you have to have this interdisciplinary thinking to solve these problems. I think that's where this is going as these kind of teams grow, this cross-disciplinary approach to solving problems.
Fabiana: Okay. So in this sense, then what it happens is not that, in your opinion, we will see that much the positions or the definition of what is the data science changing, but rather the way that things relate and interact between each other. Is that correct?
Dan: Yeah. That's absolutely correct, right? I still think that there's kind of different skill sets that are hard to have in a single person when you build a large team. If you're putting thousands of models into production, it's one thing to hire a single data scientist, throw a laptop at him, and go, "Go do some AI magic." It's kind of nonsense. It's the same way it's nonsense in traditional coding in IT. There's always a joke in IT where-- you know there's a downturn in the economy when they're looking for someone who's a master Cisco engineer, plus a Kubernetes infrastructure backend, plus they know six languages, and they can code in low-level C, and they could do desktop support. This person doesn't exist, right? And so there are discrete levels of skills, right? I mean, your math is always going to be better than mine. Your understanding of looking at algorithms and how they work is always going to be better than mine.
My ability to look at a brand-new tool, a brand-new system, abstract out how it works at a functional level, stand it up, make it work and deliver it to you so that you can kind of play with the algorithms, those are two different skill sets. But I believe that they're complementary. And as teams grow, you've got to break down the walls and allow these teams to work together a lot more closely in order to solve problems in the real world.
Fabiana: It sounds like in the end, we are doing the same path, software engineering softwares a few years ago. But now we are doing the same at the ML space. There is this need of defining the relations between what is an ML engineer, what is a data scientist, what is a data engineer? How do they relate with each other at each and every step of AI development?
Dan: Everything old is new again, right? I mean, the abstract patterns of how we do things repeat over time, right? If you look at the concept of an abstract data factory, for instance, which is like-- if you look at software development, it's like first, we figure out how to do something in a really janky fashion, and then we refine it. Then we start to abstract it up. And then we start to abstract the abstraction, right? There's a pattern that develops over time. So that's how you get to Kubernetes, for instance, right? Kubernetes is a platform that allows you to run any kind of distributed application. It doesn't need to know what kind of application you need to run, whereas if you go back in time a little bit to something like Spark, Spark is very specific, right? And now we look at Spark, and a lot of folks, "Wow, that's a AI platform." No, it's a big data platform that's pivoted to AI. And it was designed to process data. And so it does that at scale really, really well, but mostly structured data. But you can't go throw a reinforcement learning on top of that, right? You're not going to come outside of the map to do its framework. But Kubernetes has a further level of abstraction, right? You could process data at scale inside of Kubernetes, right? Or you could build a distributed web application or a reinforcement learning application because you abstracted the abstraction. And so these kind of patterns happen again and again. And history rhymes. It doesn't perfectly repeat. So in other words, you can't just take DevOps, and code it over or whatever, and go, "Bam, we've solved all the problems in artificial intelligence," right? But there's a level of-- and I talked about this coming up here, kind of these sort of six keys, I think, to practical data-centric AI in the real world.
Fabiana: Yeah. And that's exactly what I wanted to touch again as next. For you, in practical senses, what are the changes or what are the adoptions that definitely the organizations or the teams should do in order to be more data-centric AI?
Dan: So look--
Fabiana: Which goes to these six points. Yeah.
Dan: Yeah. So these six points are really it, right? And I'll define what all these are. Well, we could talk about them. But it's creative thinking, it's synthetic data, it's data augmentation, it's a new set of tooling, it's a new type of testing, and it's also clarifying instructions. We'll look at each ones. And creative thinking, we've already kind of touched on a little bit differently. But it means that you're going to have to think about the problems differently, right? And it means kind of each problem requires a unique creative solution. And it involves a different set of skills, right, than data science skills, right? You may need an IT person. You may need an audio engineer or a programmer or a graphics designer. Let's take a good example there. Let's say that I'm trying to create a model that can pick out audio in the car, pick out commands in the car. And I find that the model's really struggling with noise in the background. But I just don't have enough samples of those, that noise. Maybe I've got some wind, an occasional radio in the background, maybe the kids shouting, but I don't have enough samples of that. Now, the typical way that you're going to solve that is just try to make the algorithm smarter at dealing with background noise, filtering the noise better, right?
Coming up with a way to just segment it based on the hertz or whatever it is, right, and only listen to this narrow band where humans are speaking. But in the creative thinking process, you would solve it, potentially, in a different way. And it opens up new possibilities. I could go take a bunch of samples. So there's all kinds of open-source samples in commercial libraries of background noise. I could go grab a bunch of music from somewhere on everyone's MP3 playlist. I could grab a bunch of background noise of kids shouting. And I could synthetically create those datasets, right?
Dan: I could do that by just taking all the samples that I have and cloning them and using an audio engineer to just put some samples in the background and then have a programmer team basically automate the process, right, so that it's like, "Okay, now I've got a ton of these different samples." They're the same samples. Now I've got a ton of samples with a bunch of background noise, the kids saying, "Are we there yet? Are we there yet?" Bunch of music, bunch of wind patterns, these kinds of things. And now I might still tweak the algorithm.
I may still tweak the AI frameworks or change what I'm doing or try to refine it. But probably I'm not a research team coming up with, "Oh, can I come up with something completely different to only focus on this hertz?" And so I'm probably working with the state-of-the-art that's already kind of based on things that I'm kind of consuming. I'm a second-level sort of data scientist. I'm consuming the stuff that comes out of DeepMind and OpenAI and Ian Goodfellow, whatever they're thinking up, right? And I'm not necessarily writing my own new stuff. And there's maybe another tier over that that's kind of able to write some of the stuff. But mostly, we're not. We're going to be just taking a lot of different models that we've got, right? We're going to read the papers. We're going to grok them. We're going to implement them in code. And so solving it is almost unrealistic for many teams out there at that level. But solving it with audio engineering and programming or whatever, that's, I think, within the skill set of a vast chunk of teams, and it's totally untapped. It's totally untapped and it shouldn't be.
Fabiana: Yeah. So, essentially, adopt a more resourceful thinking around how to solve a problem rather than, "How can I use a model to overcome this?" That's essentially the message, right? So regardless of what is the problem, there are many solutions, and we just have to go for the ones that we are able to deliver in the end.
Dan: You need to have a brainstorming session at the beginning. "Let's put down 50 crazy ideas or 10 crazy ideas. Let's whittle them down, right, and go, 'Wait a minute. They're going to take--'" be totally open at the beginning, right, and adopt a creative mindset of these things where you've got a ton of different ideas. And then you're going to go through, and you're going to whittle them down and go, "Wait a minute. This is super resource intensive. This isn't going to work," whatever. But you're going to hit upon three or four different ways to kind of solve these things. I think teams are going to need to get a lot better at this over the next decade, even if they don't know it yet. It's going to happen naturally that folks are going to have to learn to solve these problems in a unique way. And that's just one example. There's many, many different examples of how you could go and think about this kind of creatively fixing the data and tweaking the model simultaneously. And I think that kind of--
Fabiana: [crosstalk] that--
Dan: Go ahead.
Fabiana: Yeah, yeah, yeah. No, I was just saying that for that thinking and for that to happen, definitely, the liability of the data responsibility to get it in a good shape kind of stretches to other areas than just the data science. And making your organization more open towards these questions, "How it is the data? How can the teams be more interdisciplinary?" that I think it's key, right?
Dan: That's absolutely the key. And I think the faster we get there, kind of the better we are, right? And I already showed kind of that model in the decks. We'll just kind of move past it. And it's kind of just kind of the same. It's all in the slides. So you get the slides later. But I won't just read through this stuff, right? We've already kind of talked about this in terms of the audio engineering or whatever. But what's interesting about the example that we already used is I kind of have already hinted at two other things, right? I've already hinted at synthetic data and data augmentation, which are the next kind of two steps. And they're different things, right? One is synthetic data itself, right? And you and I have talked a lot about this offline. And then the new report that's coming out from the AI Infrastructure Alliance's kind of covering all of the industry and the ecosystem, its maturity, its capabilities. It's going to be very exciting. Everyone's going to be reading this in a few months. It's going to come out in June or July. But the synthetic data, right, there's kind of a couple of ways to approach it. And one of the things that's interesting is we look at things like computer vision or audio where they've been able to leverage tools, for instance, that came out of video games and audio engineering and film, right? And we got so good at being able to do this that now we can actually generate novel synthetic data in many aspects of this, right? And that's exciting. So, for instance, if we look at-- we can go back to our audio example. That's really wholly synthetic data because we're grabbing a bunch of samples, and we're making new ones, right? And we're combining things, but that's really synthetic. We're inventing the background noise, and it doesn't even need to be 100% perfect.
Dan: But there's other examples of-- if I wanted to create an algorithm that's good at detecting boat crashes, I probably couldn't realistically gather enough data of boat crashes over-- it would be a time problem, the amount of time that it would take to gather that data. And I just probably wouldn't get enough of it because there's generally good boat safety, and that's great. We don't want to be-- we don't want to be encouraging more boat crashes just to get data, right? But theoretically - and you look at some of the visual engines that have come out of this, Unity, and a lot of these other engines that are out there - they could generate, with animators, essentially, boats crashing in lots of different ways. That's wholly synthetic data. And that really is-- that can really be a game changer in terms of how we think about some of these problems. And I know you probably have a lot to say on this subject since [that's?] what you folks do.
Fabiana: For sure. But you touched on one thing I really would like to emphasize, which is sometimes the synthetic data produced just needs to be good enough because you are dealing with cases that you just don't have many or you have none, but you have to start somewhere. Or I think that the point that sometimes might be harder for the teams to understand how synthetic data can help them is exactly that, when needs-- or when can I leave with the good enough? And that goes to the point as well you mentioned. So there are several situations, boat crashes. You have hurricanes. You have the predictions of, I don't know, very rare events such as earthquakes, where maybe we don't have the real data. The real data is pretty rare. But you can do something with something external or something produced synthetically, right?
Dan: Yeah. Manufacturing is one of the ones that jumps to mind, right? One of the things that Andrew's most focused on, right, is that you don't-- I mean, his entire company is focused around the manufacturing side. And there's good reason that his great brain thought about that. It's, "Whatever widget I'm producing, if I have an equal number of broken widgets, then I'm a pretty terrible manufacturer." Right? In other words, I should not have-- I should not have a lot of samples of mis-stitched shirts or smashed glass bottles, whatever it is I'm manufacturing, solar panels. In fact, we look at that a little bit later, right? And so I'm going to have to generate a lot of that data synthetically, right, in order for that to be something that-- in order to be able to even have enough references in order to be able to do something interesting with that data, right, because again, there's time infeasibility.
Dan: We used to talk about computational infeasibility in encryption, right, meaning theoretically, if you had all the processing power in the world and a futuristic quantum computer, you could run the calculations long enough to break this code. But actually, based on today's world and the amount of computing power we have, you can't, right? And so the same sort of thing here is it's a time infeasibility, right, in that you're not going to be able to get enough broken widgets, I guess, unless you go send your people to the floor and start breaking them, right? Right? But again, that's time consuming too, right? And then you're hiring a photographer and all these kinds of things to do this. So this is sort of-- this is where synthetic data really becomes interesting. And there are some limitations to different sides of it, right, I mean, in whether we can always generate a novel condition or those kinds of things. But again, the synthetic data's only going to kind of increase over the coming years, and its ability to kind of invent the world, if you will, is only going to get better.
Fabiana: Yeah. And, well, I guess I have to agree definitely. And the fact that we see reports even from organizations such as Gartner exactly stating that synthetic data is probably one of the next big things, I guess it proves exactly the point you're sharing here today.
Dan: Right. We just saw Anthem, I think, putting down a big press release and talk about how much they were planning to use it for fraud. And that's a big company kind of looking at these kinds of things, so. But then, look, this kind of gets us to number three too, is it doesn't always need to have the resources of generating synthetic data or getting familiar with those libraries or having a set of animators or whatever. You might just need some simple data augmentation. And I know this-- I do think this starts to fall into kind of classical data science skills, right? I tend to use visual examples, but again, this applies to structured data, textual data, visual data, any kind of data, right? And the simple example is I can build more instances by kind of flipping the car around, right, so the model learns to kind of see it from a few different angles. I might be able to do, like we were talking about with synthetic data, some of the things that are more possible with structured data, is feeding in the data, getting the features from it, right, and then generating a lookalike dataset or generating more instances, right, of things that I'm sort of short on, right? But again, it's data augmentation in terms of these kind of basic techniques that I think people need to get better with as well in order to level up their skill set.
Fabiana: Yep. Definitely I do think that anyway this is a skill that is, as you've mentioned, more and more commoditized in terms of the data science-based skill set, let's say. But definitely, in computer vision, it's far more easy to understand the benefits of it. We can see it right away, or we can perceive it, I guess.
Dan: Yeah. Visually, you can look at it, right? Right? I can look at this and go, "Oh. Now I understand. Now I have two versions of a car." Right? But no, it doesn't mean there's no value in it outside of the structured data or in terms of lots of rows in the database or fraud example, right? If, again, I've only got 30 different examples of the kinds of fraud I want to detect, that's still not enough data points. I'd probably want to generate slight variations of that, right, or kind of flip it around in order to give the model more to kind of sink its teeth into, right? So these kinds of things are crucial. You're right that it's commoditized, but it's something that, again, people are just going to have to continue to get better at and assume that it extends out to a lot of different things.
Dan: So, look, let's move onto the fourth one here, is the tools, right? The problem is that most of the tooling-- and this is going to be kind of, I think, somewhat clear in the [IA?] report. That's not a bad thing. It's just the way the world has developed, is that 80% of the tools have been focused on experimentation to model deployment and monitoring, right? And that's because of the way that we've been doing this. And even if we're spending 40, 50, 60 percent of our time in the data side of the house, the tools weren't necessarily there. There are some tools that are around for it, but really, having tools that are able to go through and manipulate these types of things is totally essential. And again, the vast majority are focused on that data experimentation. It's assuming the data is clean. Most of the tools are like, "Import the data once it's fixed, and you're going to keep it." But there's really no focus on the wrangling, the cleaning, transforming, the augmentation stage, right? You'd have to write your own code to do it.
Dan: And the other kinds of things you're going to see-- we talked about this with Pachyderm. Pachyderm is able to do a couple of things. One is it's able to do complex data transformations. And that's because it has decoupled logic. This is super critical to understand. If you look at something like Airflow, you look at something like Flyte, you look at something like ZenML, you look at all these tools, great tools, fantastic tools, Prefect, all these tools are useful. The thing is they're monolithic orchestrators, okay? They have to think of all the different ways that you might push, pull, drag, flip data, right? All those different things. And they have to bake that into the pipeline. They have a few different ways that you can kind of extend it if they didn't think of it. Airflow has you can kind of write your own Python code. But they kind of want you to have a second skill set, which is packaging it into a library. It's not very robust, right? So really, you can't color outside the lines. That's the challenge with these kinds of things, right?
Dan: Pachyderm's logic is totally decoupled, okay? And that means that at each step along the way, the data comes in, and then there's a transformation step. And that transformation step is running in a container. I could put any kind of code in there that I want. If I want to write a bash script to flip all those images around, I could do that. If I want to write C code, if I want to write Rust, if I want to write some Python, I could do that. I can version it. And so there's that decoupled logic. Where Pachyderm and these kind of tools really excel is when there's a complex data transformation, you've got to pull data from five different sources and enrich it. If you could just do-- for instance, if you could just do a SQL query with DBT and run it and do everything with joins in your Snowflake, you don't need Pachyderm. You don't need data engineering, right? But if you need to pull it out of Snowflake, grab data from Redshift, pull in sensor data from five other places, enrich the data, add five fields, and push it back in there, okay, that's where those complex transformations are. And those monolithic toolsets, by the way, still totally useful for orchestrating a lot of these stagnant steps, DataIQ, and all these kinds of things. All these are useful in practice, but having this sort of decoupled logic where you could basically think up anything.
Dan: And when you go back to that example from earlier, where if I wanted to transform the audio, I want to go and write a bunch of code to kind of take my samples, grab a set of samples, feed them together with the things that I have, and maybe I want-- maybe one of the steps is I've got to transcode the audio from WAV to MP4. And there's a C library out there that could do it super easily. It makes no sense to rewrite that in my own Python code. That's ridiculous and total waste of time. But with Pachyderm, I can grab that library, put it in a container, write some wrapper code around it, execute it, let it do its thing. And that's one step. Now it's gone from WAV to MP3. Now I write the next set of code that's taking all my background samples and iterating through them and doing an output at the end. So, again, you're going to need language-agnostic tools. You're going to need some decoupled logic, and you're going to need the ability to have really flexible data-centric pipelines. These types of things largely have not been the focus of data science teams and even the focus of data science purchasing. Again, almost 100% of that thing has been on experimentation and the interface behind it. And I think this really is going to have to change for people to do data-centric AI [inaudible].
Fabiana: But this goes back again to the point we shared at the beginning around the teams, right? The multidisciplinary teams. They nowadays don't have a tool where they can be together. This is about having that tool, right?
Dan: Yeah. You've got to have-- and each team may need different sets of tools. That's fine. You can't think of-- you can't think of-- anyone who tells you that they've got the end-to-end data science platform is like selling you a bridge. I'm sorry. And look, there are many folks in the AI Infrastructure Alliance, in full disclosure, who have many aspects of that end to end, and depending on what you're doing, right? If you're doing kind of-- if you need some AutoML where you can throw 100 different things preexisting at it, DataRobot's going to have that thing for you. They're fantastic at these kinds of things. So this is not to take away from any of these tools that sort of say that there's an end to end. But where that end to end tends to be is really ingest data from some different sources, do some basic cleaning, and those kinds of things. And then really get to the experimentation, usually at a structured level, and kind of a visualization citizen data scientist kind of thing.
Dan: These are totally useful, but your team may need FDMA. It may need some-- and it may need Audacity. It may an audio-editing tool. It may need Pachyderm to do the data-centric AI. It may need Prefect to kind of help orchestrate some of those things. And then Pachyderm can even run Prefect in Core in a container as one of the steps, right, in order to do kind of that transformation. Then you're not rewriting the code. You're just calling Prefect as one of the steps inside of Pachyderm. So it's going to necessarily need different toolsets. And again, when we talk about the openness of the team, we need the openness of the organization to understand that this isn't a case of like-- that the market hasn't matured enough to the point where nobody ever got fired for buying IBM, like in the old days, or nobody ever got fired for buying Microsoft in Intel. We're not there yet where there's the one platform to do 90,000 things. We're going to need a best-of-breed solution to all of these things. You're going to need to trust your teams to give them the tools that they need.
Dan: And then the last thing I'll kind of talk on, and this is sort of part and parcel, the versioning and lineage. Versioning and lineage is something where as you're iterating on the datasets, you're going to keep more track of them. And I think teams are going to learn this even more. Honestly, if I'm being 100%, sometimes when we talk about data versioning and lineage, it's a late-stage problem for where data scientists are a lot of the times today, right? By the time they realize this problem, they're like, "Oh my God. This is horrible." Right? Three months later, a regulator comes in and is like, "Well, we don't like the data that you used. You violated XYZ health policy or XYZ policy. You've got to go back and strip all that data out of the model." And they're like, "Oh. Oops. We've already iterated on that data 12 times." And they're in trouble. They're straight up in trouble. And we're talking about existential-level event trouble, right? We're talking about getting fined trouble.
Dan: But it's not just that kind of level of trouble. It's just sort of, if you can't go back to the model in the code and recreate the thing that you did-- if I've got a million JPEGs in a directory and I go ahead and as the IT person come in there and then crunch them down from 1024x768 to 512x512 and I overwrite them, guess what? Your model may not perform as well. And there's nothing you can do about it unless you're going to the backup. So having these iterations, especially as now I'm pushing and pulling the data, adding new samples, right-- and what you don't want is this crap of like, "Well, we're just going to-- we're going to clone the directory 50 times and call it Unknown Version 6 and Unknown 456, and now I'm going to know what's in that data." You want something that's able to track the lineage and able to understand this model in this set of code at this point in time with this exact dataset. You want it to be deduplicated. Pachyderm's got a copy-on-write file system that does this. Some other tools are doing this too, mostly with copying it. That works okay for structured data. But the problem is if you've got one-gig video files and you're making 35 copies of those. I hope you have a lot more money in your Amazon bill. So having deduplicated versions where you've only changed five bits, that's super useful. So these kinds of things, again, the toolset is going to become important as people think about these things. And the problem is coming like a hurricane, but people, they don't see it yet until they start to kind of move in this direction. So I think we've gone into a lot of that.
Dan: So I'm going to go right into five here, right? And this is another thing that we're kind of stealing from the past, right? And things that already work in today's traditional programming houses and all this, building tests and iteration tests, unit tests, smoke tests, all these kinds of things, but with kind of the data, right, itself, right? There's been changes to the data, and then the model's sort of performing poorly. Or we ingested a bunch of synthetic data. Again, better hope you have versioning because you just retrained it five times. And wait a minute. The synthetic data, it's an imprecise science at some times, right, where maybe you generated a bunch of samples that didn't work right, right, or [crosstalk] affect model accuracy. You got to go back. If you happen to go to the backup tapes, like it's 1985, right, then you're in big trouble, right? So you've got to make sure that you're writing these unit tests at each stage that kind of understands data quality. I know that you guys have got a ton of focus on sort of data quality and things like that at this point. This is super important, right? Do you even know how many kind of examples you have over-representation and under-representation? Do you have a test built in for these kinds of things? Can you test for it? When new data gets added, do you understand the implications? These kinds of things. You're going to have to start building unit tests like a real kind of IT developing team.
Fabiana: Exactly. Exactly. Thinking about the question you want to see answer and try to test again and again just to ensure that you are achieving the quality you are expecting, for sure.
Dan: Right. A human-in-the-loop test each stage of your development process. These things are totally important too. We'll talk about this, right? Let's look at the labeling example again. I use a lot of visual examples in here because text is boring to look at. But remember, all these techniques apply to structured data as easily as unstructured data, right? But let's say I've got this labeling platform, right? And I've got Snorkel or I've got Scale, and they've got terrific data scientists. A lot of their value in their platforms is not the fact that they can just manually get a bunch of people together and do a bunch of things. You could do those things with a lot of different platforms. Their thing is they've got a lot of data scientists that are building these semi-supervised learning systems that can kind of help you quickly kind of auto label a lot of stuff. But sometimes those things are really noisy or they don't do them really well, right? That's one thing.
Dan: So maybe you've got an instant segment auto generator where you're like, "Okay. It's going to auto magically put a bounding box around what it thinks are errors or flaws in this panel." Right? Or we can think about it in the case of the humans. They might have said, "We put the instructions down," and they go, "Look, go put a bounding box around any of the flaws in this." And you've got 30 different people working on it or 150 people or 1,000 people working on it, and they all hear those instructions and interpret them a little bit differently. And so you might not get exactly what you want. And so the example kind of becomes back to this human-in-the-loop testing, is you've got to have a system that's able to kind of look at the data at different points to understand whether the labels are correct, for instance, or whether you've got additional points that actually make sense. And you may actually want to highlight all these little flaws that are in there that you could see in the second panel, right? Or maybe - again, this is where your domain expertise comes in - you may say, "You know what? We don't care about those because they don't really affect the performance all that much." Maybe these little dots-- I don't know much about solar panels, but maybe you do out there, right? And maybe you know that these six or seven dots that I've highlighted here maybe add up to like half a percentage point degradation. That's not enough to kick the panel off the line. But these big ones matter. Or maybe these things do do a lot more damage. Maybe it makes it 3 or 5 or 7 percent less efficient, right? So you've got to bring in that domain knowledge and then that kind of-- you've got to go back and unify those kinds of things.
Dan: If you look at something like Superb AI, they've got an example on their site of the human-in-the-loop test where they go and look at all the labels. They overlay them automatically on all the thing, right? And it goes, "Oh, there's a 98% crossover. Okay. Well, we're going to go talk to the 2% of folks who do that. And we're going to help change our instructions." And this is kind of-- we'll get to this in just a second. But the one thing I want to emphasize to people is your job is not to go, "Well, the people who did this are geniuses, and everyone who did this is an idiot." Right? "And they just don't understand what I'm talking about here." No, that's not the way to approach it, right? When there's a deviance between kind of the two, the correct answer is to say, "What did we explain poorly to the folks who are doing this work? And how can we fix that?" So we'll get to that in a second.
Dan: So this example here is another one kind of snagged from our boy, Andrew, the [other?] reason for me. Thank you, Andrew, for this wonderful news. But again, this is an example of, look, the bounding boxes being different. Put bounding boxes surrounding iguanas. So any human could reasonably interpret it in any of these three ways that is kind of shown here, right? And so which one do you actually mean? In the same way that I showed here, what do I mean by thing? And I need to go back and clarify my instructions to folks and refine those down to the point that I get what I want out of it. And again, that's a new iteration. You want a version of that. And you want feedback if you kind of go through these different things. And you understand what's happening, right? Oh, there's the slide from our wonderful folks [crosstalk], right?
Fabiana: The consensus. Exactly.
Dan: Yeah. The consensus, right? It shows like, "Oh." But it shows like, "Okay." And then there's a QA process. It says, "Wait a minute. This is an issue over here." Like, "Labeler one put this--" I mean, I hope no labeler does this, but humans are capable of anything. They put it randomly around the fish, whereas this one kind of covers the entire fish, right? And you want to be able to approve or reject. And you probably want some programming ability also to do that, right? So, look, we'll sum this up here. And then let's get to some audience questions, right? I think it's really-- in the future of AI, look, we've always thought of ML code as the cool part of the data science. And in many ways, really it is. Let's just be honest, right? And it's the biggest breakthroughs we hear, AlphaFold coming out. And all of a sudden, now we can do all kinds of new bioscience and things that just were previously impossible. That's super cool, right, when we see a new reinforcement learning algorithm cloud, and suddenly, we've got robots that are multifunctional, and they're like-- now, I can't wait to have one that's cleaning my house and these kinds of thing, right?
Dan: So there's all kinds of these breakthroughs. But look, most teams are not research teams, okay? They're not. They don't have people on there that are writing on algorithms. They have people really, really smart and go read the most cutting-edge papers on archive. They can stay up to date. They can implement that in cloud. They can put it in there. But the truth is they're probably not going to be writing their own stuff that's going to improve on that. And so, actually, the algorithm is really a small component of the overall system. In fact, I think we've gotten to the point, because of so much focus on it, that many ways that the algorithms are very refined for a huge chunk of tasks-- it doesn't mean for everything, right? There are problems you can't solve with today's algorithms, no question about it. But the question is, is your team going to able to come up with that algorithm that can solve it? And the answer's probably not. Is OpenAI or DeepMind or 10 other research teams out there? You're probably going to be consuming a foundation model that they've come up with or a paper they've come up with later on down the line if the problem's unsolvable.
Dan: But if you're working on neural translate now or you're working on fraud detection, if you're working on-- if you're working on object detection, whatever, these are, in many ways, solved problems. It doesn't mean they're perfectly solved problems, but there are many ways to solve them with the algorithm. But you're not going to go and update the algorithm. You're going to basically focus on making that dataset better, verifying it, transforming it, augmenting, adding to it with synthetic data. And that makes your feature extraction better, right? It makes the overall process better, and it unifies your team, right? To me, this is 100% the future of AI. And again, 60 to 70 percent of AI complexity is bound to the data processing, monitoring, augmentation, the handling. The data-centric AI is really just focusing back where we spend most of our time anyway, right? Right? And stop pretending that we're all Ian Goodfellow. We're not. I'm certainly not. And I'm not going to come up with a--
Dan: I mean, Ian talked-- I'm going to segue here. This is hilarious. Ian talked about a point where he had-- it's not hilarious, his story, what happened. But when he had a-- he thought he was going to die at one point when he was very young, right? And he called up his friend, and he brain-dumped all of the ideas he had about artificial intelligence because he wanted to make sure he got them down to someone before he died. I'm telling you right now, if I am dying tomorrow, the first thing I'm not doing is calling my friend to tell about all of my ideas about AI. [laughter] Hey, I'm not that good, and most people out there are not that good, okay? So leave it to Ian and everyone else to come up with that stuff, and you worry about getting good with synthetic data and data processing and versioning and lineage and all this good stuff. And you can really do some amazing stuff out there [inaudible].
Fabiana: I guess it's all about the trade-offs and knowing your limitations and know where to invest your time because in the end, it's all about time and money and where you will get the bigger stake. And I guess, as you mentioned, majority of the time, we are anyways doing data preparation. Why not just invest a bit more on the tooling space for exactly that? Yeah. That's a very interesting point.
Dan: That's right. And I think we had--
Fabiana: So--
Dan: We had a question there, right, on the data-centric AI and how does it improve tabular data analysis, right? I think we did talk a bit about it during the meet. But Fabiana, you've got some good ideas about-- I mean, with the synthetic data, maybe talk about it from that angle, actually. That's probably useful.
Fabiana: Definitely. And I think there are two angles for this question that should be covered because, at one stage, one of the points that we see that could be improved around tabular data is first get the right understanding of your data. So a lot of the times, we do kind of assume, as you mentioned, that the structured data, because it seems so clean, everything is fine. And we don't pay enough attention around getting inconsistencies, for example. So it's more or less-- some of the things you mentioned around this presentation, as you said, goes back to the side of tabular data. So, first, get your exploratory data analysis. Get your profiling of your data. Standardize around different teams. Really go in-depth into your data. So adopt standardized ways. Adopt tools to do that for your data science team. And afterwards - well, I'll be a bit repetitive in a sense - adopt new methods to help you overcome those issues. Synthetic data is one of them. Data augmentation, for sure. And again, data quality here will depend a lot on the use case that you have. So be specific on the question that you want to see answer. Be clear to everyone that's the question you want to see answered as well. So all the points that Dan mentioned today, all the six of them, I do think can easily apply for the tabular data, in a nutshell.
Dan: All of them can, absolutely, right? I mean, if we look at-- I mean, look, an example of unit tests, right? Let's imagine that I've got a bunch of customer data with APIs in there. Unlike a visual example, I'm not going to go look at it and understand whether my 75-character API key is in the correct format. But I can write heuristics. Again, I don't need something super fancy. I don't need an ML model to go test whether my API field conforms to the correct standard. I can write some heuristics just like we do with looking at credit card data forms or whatever for, "Hey, is this actually the correct number of numbers and letters? Does it have characters in it that are wrong?" These kinds of things. Understanding that kind of thing, building those tests in. Are there missing fields, for instance, right, and test for those kinds of things. Or are there fields where half of the data's missing or whatever? Should I impute that data? Those kinds of things. All of these things can be applied to tabular data as well. Again, in a presentation, it's so boring to look at the tabular ones. But a lot of times, we live in tabular data land. I like to use the visual examples in the presentations because it's, again, easier for us to look at, right? Again, if I put up 10 different API strings, we're not going to be able to look at that. And we could write the code and do it. But rest assured, every one of these things applies to how am I going to go and do it? If I were thinking about fraud detention, I mean, I might go and think about how to iterate on the different aspects of fraud that are out there, right? I might go and--
Fabiana: Even around the labels. That's a good example, the one you gave. So different persons, different analysts see fraud in different ways. They do have different manner of labeling the data. I saw this in the past also with, for example, smart meters. Different persons do different labels for the windowing, for example, on the time-series data. How can you achieve consistency? How can you analyze that? What are your expectations for the business? So that goes to your point then, for sure.
Dan: Right. And look, every piece of data is labeled in some way. But obviously, there's simpler sets of algorithms where the labels are implied, for instance, right? In other words, if I'm looking at-- if I'm looking at whether the customer clicked or not, that's an implied, and it's either yes or no. It's binary, right? And that's in there, so. But the labeling, that case is not going to make any difference, right? But the unit test in terms of cleaning the data, understanding whether those are actually correct, right, building heuristics around that, building-- look, there are lots of little pieces there, right, that can still be incredibly useful in terms of data. Does that mean that every single problem is going to fall to a data-centric AI approach? Of course not, right? There's not a panacea to every single problem. I'm not saying like, "Here's your hammer, and it also acts as a screwdriver, right?" But I want folks to just open up their minds and think about things in kind of a fresh way and understand that they may be missing a lot of good solutions to the challenges that they've never seen so far.
Fabiana: It's not like all of the sudden, hyperparameter tuning stops being important. It's not about that. It's just about investing a bit more in something that have been missing, I guess. That's definitely, I think so.
Dan: That's exactly it. That's exactly it. It's focusing in on that your data engineers and everyone else is just as important as the data scientists and even more so getting them talking to each other. And maybe we have to go back to those old offices where they'd tear down the walls and everyone codes on bean bags. I hope not. I always hated that, personally. [laughter]. I like my own space. But for tearing down the virtual wall--
Fabiana: [crosstalk] a similar opinion as you.
Dan: Yeah. Yeah. And this is the engineer side, right? I'm like, "Just leave me alone for two weeks so I can do what I need to do here, right? I don't want a bean bag. I don't want to code with rock music on." But tearing down those walls virtually and allowing people to kind of work together in a cross-functional way, super, super important. So do we have-- I don't know that we have any other questions. You and I could probably talk for an hour and a half. But at the same time, I think we can always kind of wrap or kind of open it up to folks if they've got some-- give people a second chance in here if there's anything they're thinking of. I hope that you learned something, enjoyed something. Again, talk is going to be available after the fact. I think a ton of people end up watching these things after the fact now. We're all busy. And the COVID has changed the way we consume information. So if you missed it today, you're still going to have one that you're going to be able to download and use. And hopefully, it's going to give people insights that they just were missing.
Fabiana: Yeah. For sure. And, well, just like we heard you today, I guess, if-- of course, you have always one more shot to ask a question. If not, we are eager anyways to hear your definition of what is data-centric AI at the Data-Centric AI Community. So if after this session, you now are more clear and you would like to give us your opinion, well, feel free to submit at [contacts?].datacentricai.community. We are eager to hear you.
Dan: Awesome. You come and make sure you're in the Data-Centric AI Community. You're probably already in here if you're already on the webinar. If you're not, then be there. That's one of the newest communities inside the AI Infrastructure Alliance. We're kind of a community of communities, and we'd love to have your help if you're out there working in data engineering, data science. By the way, on the IA, especially on the board of data science, we're doing some really cool projects. Thanks, everyone, for the time. It was most appreciated. And we hope you have a wonderful day.
Fabiana: Yeah. Thank you.