Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

MLOps Innovator Series: The Role of Synthetic Data for Data-Centric AI

Fabiana Clemente

Co-founder, YData

Data-Centric AI has emerged as a new methodology in the last year which aims to classify data as the focal point for AI and ML. But with data, teams must examine a number of different issues. How do you source the right data? Can you source enough of the data that’s relevant?

Join us as we chat with Fabiana Clemente at YData on how teams can use synthetic data to embrace a Data Centric AI mentality.


Data-centric AI is a paradigm shift on the horizon. MLOps teams are recognizing the weaknesses of focusing entirely on the functionality of their models and seeing the importance of treating their data as first-class citizens in the machine learning life cycle.

Yet, alongside the realization above, emerge new challenges. How do you ensure that your data is of high quality? What if you don’t have enough data? In this webinar with Fabiana Clemente of YData, we explore the answers to these questions and the role synthetic data plays in the solution.

Pachyderm: The Foundation You Need for a Data-Centric Approach to AI

The use of synthetic data in the industry highlights the need to ensure data quality. Organizations can stay ahead and start adopting the data-centric approach to AI by leveraging the power of Pachyderm. Pachyderm’s data-driven ML pipelines and data versioning can help your MLOps team scale their operations, automate processes and evaluate the impact of data on model performance. See Pachyderm in action by requesting a demo today!

Webinar Transcript

Dan: Hello and welcome to another edition of the MLOps Innovator Series. And I am on with Fabiana from YData. Say hello.

Fabiana: Hi, Dan. It's a pleasure to be here.

Dan: Good to have you on the show. And today we're going to talk about the very exciting topic of synthetic data, which you know all kinds of things about. But maybe in the beginning, just tell everyone a little about yourself, and your history, and how you got into this exciting field.

Fabiana: Yeah. Sure. So well, I'm based in Portugal, or at least for the moment. But my passion about data and this area started when I decided to take applied maths in university. So back then, the field was not called data science, so you could call multivariate statistic analysis. But that was my first experience with data and understanding that data can give you pretty cool insights about what happens in reality and how you can take decisions. From there, it was an obvious path to go. So I did have the passion for computers, so why not? In terms of professional life, I started working right away with data in the data space, and the interest just kept growing. Well, in the end, I just became more eager to learn more and more of the machine learning space. So I started with statistics, then of course, started studying the ML space. And latest in years I started searching and researching a bit about deep learning. That was when definitely I started also looking into the space of synthetic data, not only because I did have some challenges around data access by that time while working for other companies and organizations, but also because synthetic data can be so versatile. So definitely, that's just made my interest grow and grow, and well, here we are today, with YData company dedicated to the space.

Dan: Well, I took the exact opposite path; you did applied maths, and I went to New York University and I did-- I went to what I call the liberal, liberal arts college, which was I went to the liberal arts college and then I thought, "No, that's too structured. So I'm going to go to this super liberal arts college, The Gallatin Division," which is, make your own major. So I studied everything from film to literature to computer science to philosophy to history, to everything else that I felt like taking at the time. [laughter] And strangely enough, that was a pretty cool way to teach me about critical thinking, which kind of brought me into technology and then brought me into the space in a very roundabout way. [laughter] But there wasn't any applied maths at all, I'll tell you that. That was maybe one of the cool things that I skipped. So let's talk a bit about--

Fabiana: I can understand why. [laughter]

Dan: Yeah, yeah. You're saying philosophy and filmmaking don't go along? Although nowadays, maybe they do go along with them, right? I mean, there's all computer graphics and the special effects. There's a whole engineering department that started back with Jurassic Park, and those things too.

Fabiana: Yeah, yeah. Totally, totally. So maybe it makes total sense. Destiny has weird ways to make things work out, so. [laughter]

Dan: Yeah. And maybe I'll figure out the formula for a successful story. You just apply this formula and it creates a blockbuster novel and a blockbuster movie, and [inaudible]. [laughter]

Fabiana: Yeah. Totally.

Data Quality for Machine Learning

Dan: I think we may be a little far away from that. But let's talk about the problem of data in general, before we get into synthetic data and why it's important. I mean, tell me a bit about the problem we're trying to solve. Why is it hard getting good data? Tell me about data quality. What's the real challenge here?

Fabiana: Yep. It's interesting that in the latest years we have been hearing about machine learning and how machine learning can bring a lot of change, and we have this idea that with bigger storages, in the latest years, more and more data is easily available with internet, the cellphones. So everything generates data nowadays, so why not assume that we have more than enough data in order to invest more and more in machine learning? Definitely, this is true, but-- and what is more interesting-- and this takes me to the place that I want to further explore, which is the side of the data quality. So everyone knows that saying that, if something bad gets in, something bad gets out, when talking about machine learning. So that's something everyone knows. But surprisingly or not, data is probably what I would call the most undervalued and deglamorized aspect of AI. So we see a lot of attention in machine learning models, we see a lot of attention in the tooling that make these models work but we don't give enough attention to the data itself.

And this takes me back to one article I read, I think it was last year, from Google, which is about the high stakes in AI of data. It reflects really the importance of taking care of data quality of each and every step, and how each and every step can deeply impact the results you get from data. So not always it means that having a lot is good. Sometimes having the right data with the right quality is what matters for your machine learning development, and that's what we see with this new trend, amid this new start of this more data-centric approach in the space of AI. I believe that's exactly the concept we see rising nowadays. So how can we put more effort to getting the data right and in a better shape with better quality in order to get a best or a better machine learning model? So I'm a huge fan of this mindset, and I guess that's what exactly I've been pitching and dedicating myself in the latest years, let's say.

Dan: I think Andrew is rating in the marketing better than any of us these days, but he's a--

Fabiana: For sure. [laughter]

Dan: --but he's a brilliant--

Fabiana: He's brilliant. [laughter]

Dan: Right. He's a genius. So I guess we give him a little bit of extra credit.

Fabiana: For sure. 

The Challenges of Inadequate Data

Dan: So you touched on one thing that's important, is that you can't always get the data that you want, or your reference, garbage in, garbage out as we usually say it, right?

Fabiana: Yeah.

Dan: When you take something like AlphaGo in the initial thing, the data is relatively easy to obtain, right? There's a reason you start with games. If I've got a 100 years worth of incredible top-notch Go players who play games, and it's a very simple data to represent. I've just got a number of different position statements and such. And it's relatively easy to label in terms of understanding who won and who didn't. That's a very clean dataset, and so naturally, you focus on the algorithm, but if you're talking about something that's more challenging; if I've got a factory, and I've got widgets, and I'm trying to detect broken widgets, I probably have a lot of pictures of working widgets and not so many broken ones. And that's a challenge, and so that kind of gets us into this synthetic data challenge. Why is synthetic data important to this concept? If you're trying to handcraft a smaller dataset - and Andrew talks about this a lot - where not everybody is Google, not everybody is gazillion lines of data streaming in at every second, most problems are going to be smaller than that. You're going to have to handcraft it. Having synthetic data kind of helped with that and get us to a better solution.

Fabiana: Yep, and that's a very interesting example, especially the one that Andrew brought about, the failures in development. And definitely, we don't have that-- not all organizations have a lot of data, for sure, and they have to work with what they have. In that space, I do believe synthetic data can take a huge role. But of course, we can see synthetic data in different levels, and different levels of synthetic data can help in different ways. For example, let's say, for a small dataset that we want to augment in order to get a signal better, synthetic data can be the answer to that. So you just grab the representation that you already have of the problem, you have a synthesizer that learns this representation and afterwards is able to generate more elements of a space that is close to the real representation. So combining together the real one with the synthetic, you kind of get that augmented signal you need in order for your machine learning models to perform better. That's one example, but of course, there are much more examples, which we already see in the markets, especially in the image space.

So self-driving cars is probably the most interesting space where we have hundreds of examples where synthetic data is usually in a very smart manner, so you don't always have to collect all the data for all the cases or possibilities. Sometimes you just need to get a proxy kind of dataset. Let's say you have a car, a representation of the car, or the behavior of the car during the summer. Now you want to get pictures that would copy the behavior of the car during the winter. That's possible with synthetic data. Of course, it's not exactly what you see in real life, or what you would get from getting to real data, but it's close enough and spares you from having all this heavy work of collecting the data, even waiting to take the right pictures or get the right information. So I just see a lot of benefits of using synthetic data, especially for businesses that don't have the luxury of waiting much more time for more data. Or all the problem of collecting more data and storing it then cataloging it, and so on and so forth.

Applying Synthetic Data for Machine Learning

Dan: So you gave me an example in terms of self-driving cars, right? And that's an example where Lyft, and Uber, and Waymo, and all these companies have been working on and have built these gigantic simulators, and they've probably driven more miles in the simulation space than they've driven in the real world because you can only drive so many miles in the real world, right? So there's a scaling problem. Of course, the challenge starts-- but that begs a different question, right? Which types of problems are good for synthetic data? If I can build a simulator that gets me pretty far, and so in the field of robotics, that's pretty useful. But at the same time, there's challenges to it, right? I can't perfectly simulate the real world. If I had that level of computing power to be able to do perfect physics, then I'd probably be able to have an all-powerful AI anyway because I'd already be at a superhuman level of computing power. So there's limitations on that. But where else are there kind of like success and where is limitations? I would say historically, for instance, IBM Watson for instance, IBM Watson Health had kind of a historical debacle in terms of trying to simulate health data, and it just wasn't accurate, right? They were trying to simulate disease data, and it just wasn't accurate. Can it help in the healthcare space? Can it help in-- where is it really successful in the types of synthetic data we can generate? And where do we have a long way to go? And where are we sort of in the middle?

Fabiana: That's an interesting question. Definitely, in the health sector-- bring in health sector because you gave a very good example of the IBM. The health sector is one of the sectors where we have a harder access to data, which means that the majority of the time, we don't have the full picture of the use we want to tackle, either because the data is really too small or because it's very hard to capture exactly what we want. A lot of things are not so objective. I'm not talking about just images, I'm talking about also what happens between patients to doctor, and that's very hard to capture. But one thing for sure, synthetic data in that space is very good in unlocking new data on the health sector space, for example. Data that previously was not available because of security and privacy concerns now can be unlocked through the use of synthetic data, because well, of course, you don't have access to the real patient's behavior, you don't have the real CT scans or so, but now you have something that is similar to the reality, has the usefulness of a real dataset and you can use for further research to bringing healthcare to the next level, for example.

It's also being used in combination with federated learning, for example, exactly for those purposes. But where I do feel it is bringing in, already demonstrating a lot of value, is in spaces where the use of machine learning is very well developed. So if you go and see financial sectors, synthetic data can play a huge role. Of course, from the privacy aspect of things, the same as the healthcare, but also, in the space of, for example, helping and better detect fraudulent behaviors or to debias or balance some datasets, or even to create simulations and help in better calculations or credit at risk or risk analysis in general. So this is where you already see heavy usage of synthetic data in different types, of course, of synthetic data. That's where I do think it's the most developed area. But I do believe that we are still in the baby steps of what synthetic data can bring and can offer to the healthcare sector, for example.

Dan: And do you see a lot of work in terms of structured versus unstructured data? So there's a unity platform that can generate kind of photo-realistic images, but do you see a lot of synthetic data working in generating images and video, and those kinds of things, or do you see it primarily in the textural kind of structured data-based style stuff? Or is it kind of a mixture of both, so far?

Fabiana: To be honest, it's much more and far more developed in the image space. And I guess for one main reason, for a human being, it's far more easy to understand whether the synthetic data has value by just looking at the data. So when you look at the image, right away you understand whether that's synthetic data, it's quality or not. When it comes to the more transactional or operational databases, synthetic, it's in baby steps. So it's starting. So you already have a lot of development, it's already been used a lot. But it's harder to convince that the quality or it's a good replacement, or a good option to combine with real data, exactly because it's harder for you to prove the privacy, although it's definitely there, the proof. And also, it's harder to understand the similarity, or the usefulness. And I guess that's why you see a lot of development in that area. But still, organizations are starting to adopt it in not as such a structure adoption, as you see in the image space or video space, for example.

Dan: Okay. That's interesting. I almost thought you would say the opposite, that there would be further along with kind of doing things in the structure data. But it makes sense that-- unless we are talking about kind of generating a cat scan or a tumor, then you need an expert to be able to look at that and understand. And that's different. But if I'm generating pictures of dogs and cats and people walking down the street, I can sort of intuitively look at it, even with an untrained set of labelers, and kind of get a sense of that uncanny valley. Does it make sense? Does it look real? Is someone's head floating in this direction? And they're kind of-- feet aren't touching the ground. There's a sense that we know what's right, and we can kind of quickly examine it. So that's fascinating. And actually, it makes sense, now that you kind of frame it in that way.

Fabiana: Exactly. Exactly. Well, I want to share one perspective that once put me at thinking, because sometimes we are so worried in having real-looking synthetic data because that's what we trust. But for example, I do remember it was, I think, Uber that shared with us an article about GTNs, which basically, they shown that sometimes for the models to understand whether that's a car or the type of object you are synthesizing, the image doesn't need to make sense to the human eye. So essentially, if you look at the data generated by their synthesizer, it does not make any sense for the human eye. But the data has a lot of quality for machine learning purposes, for example, which is a kind of a different perspective of what you would thought to make more sense.

Dan: Yeah. Well, so let's think about a little bit of-- let's kind of jump into the process of generating the synthetic data itself, right? And I think you and I have talked about this in the past, a little bit of like-- I think you equated it to there being three different levels of generating synthetic data, but. So what does it look like to generate it? And how do we get there so that we've got kind of effective data that we can start to use in training our models and doing cool things?

Fabiana: Yeah. Yeah. Yeah. As I mentioned in the past, we do have what I consider the three different levels, and each level, I do believe, that delivers different value for different use cases. I guess, the first level is what we are old familiar with which is about dummy or test data. So that data that you generate based on some characteristics, like your generating names, emails, and so on and so forth, but does not need to hold a real value. It means you are not going to use it for machine learning purposes or to extract information from it or insides. You are just testing out and checking whether some of your rules or some of your services will still hold given a certain type of characteristics of the data. That would be the first; and this one is probably very familiar to everyone. Then we have what we call the business-rule-driven kind of synthetic data. So here you have a bit more context. Your data is not so dummy, so it's a dataset that you already have developed based on your business definition. So you already know some behaviors, you already know some distributions, you are setting variable by variable, what would be the expected behavior from your variables? And you generate new datasets based on that.

But what does it miss here in this approach in order to be used for machine learning purposes? Usually this data is useful for stress tests, a business specifically, or to run some BI validations, for example, but rarely can be used for machine learning purposes because you are missing the interrelations that exist between variables at the first and second and third level of the relations. Those causality relations normally are hidden and are very hard to express in a single distribution, if I'm making sense. Then you have what I call the third-end level that at YData, we are dedicating ourselves, which is to mimic the real data behavior. So when you are generating this entire data, you are essentially guaranteeing that you mimic all this relations for first, second, third level, and so on and so forth. You ensure that you guarantee that all the causality effects, or cause effects, that you find within the datasets are kept. So this means that your data is useful for machine learning insights, which unblocks use cases around privacy, but also around things like debiasing datasets or augmenting datasets or even balancing out datasets, for example.

Dan: So there's a couple of things that immediately spring to mind. One is, there's a bit of a chicken-or-the-egg problem in terms of generating synthetic data, right? Now it's, if I have a machine learning model, what I'm looking for is, generally, insights that humans perhaps haven't had in terms of that data, right? I'm looking for the types of-- when I think about this, I think about something like the problem of spam versus ham, right, which one of the classic machine learning problems, right? So I'm old enough - I'm dating myself here - to remember before, we had kind of Bayesian filters, where folks were trying to build spam engines that were just based on rules, right? And so the humans would come in and they go, "Wait a minute, somebody just emailed me, 'Dear friend.' How dare they? I'm going to create a role. If it says, 'Dear friend,' in the subject line or some variation of that or if it's bright red or all caps." And they create this towering inferno of rules that are about 60 to 70 percent effective and then fall apart. And then once they started to use Bayesian statistics, those started to look at the corpus of ham and spam and it started finding all kinds of different tokens in there that were very different than what the humans saw. It might find a hex code for certain shades of red, or something like that, that a human didn't think to look for. So in other words, the challenge, of course, that I'm getting at here is that if you already kind of know what you're looking for, then you can theoretically generate perfect synthetic data. But if you don't know what you don't know, how do you deal with this problem? How do you generate something that's realistic but you don't necessarily have all the features of the things that you're looking for?

Fabiana: Yeah, that's a very interesting and I would say, one of the topics where we have to explore a lot more around synthetic data. It's probably one of the fields to be evolved in the space of synthetic data. So so far, we were very focused in insuring we could mimic what we already know and exists, okay, with a certain level of variability, but now we are starting to get into the world of what we call simulation or generating of simulated buffs or behaviors using synthetic data. So you already saw this happening, for example, for COVID and for the development of vaccines. It was probably one of the first. But you also use [inaudible] doing the same. So they used generative adversarial nets exactly to simulate certain physics behavior, and well, based on the different simulations, they kind of get what would be an expected pattern or expected behavior. That's where I do feel like there are limitations nowadays, but definitely, I believe the research will bring us forward in that space. How can we try to understand what we haven't yet seen based on a combination of conditions or scenarios that we know that might happen? That's what I feel like synthetic data space can evolve to.

Dan: And so then there's also-- if I'm doing sort of a statistical analysis and I'm kind of creating permutations of the data, do I really get an advantage? In other words, if I just generate more of the data that's similar, how is that necessarily getting me a more robust model? Because essentially, don't I just have more of the same? And Google's kind of shown that problem, that the algorithm almost doesn't matter in the end; if you have enough data, they all kind of converge to the same solution. Maybe this one gets there much faster, maybe this one is real slow and kind of crappy, but gets there later, but they all sort of converge. Am I just generating a ton? At some point, does it-- obviously, if I don't have enough data and I need more permutations of it, that gets me to something that's more robust. But if I already have a decent amount of data and I generate more permutations, is that actually giving me fresh insights? How do you sort of just deal with that problem?

Getting Closer to Reality with Synthetic Data

Fabiana: To be completely honest, and I'm being transparent here, the use of synthetic data for augmentation, it's a trade-off balance. So definitely, if you have not enough and you just create those permutations out of the real world that you know, you are creating value and definitely you are augmenting, let's say, the signal you want further models to learn. But if you already have enough data and you are pretty sure that your reality is very well represented by that dataset, do I think synthetic data will make the difference? I don't think so. Unless you tell me you don't have access to the dataset with the granularity that you need, you don't have all the variables because of privacy, for example, that's when I do think if you have enough data, synthetic data can be useful. Because otherwise, yes, if you have already enough, a robust dataset with the quality you would expect, to be just adding more synthetic data, you won't feel the gain. So as you mentioned, there is a tradeoff, and there is a certain threshold where the amount of data may have an effect on the result. After that, it's more of the same.

But let's say that you feel like your reality is not well represented and you might be missing some cases, that's when strategies around the area of synthetic simulation can help you out in getting a more robust and case-proof model. So you are generating cases that were unseen in the real data, but you are simulating them with synthetic data so you can ensure that your model won't fail or break if that type of pattern is in the future identified. That's where I feel like if you have enough data, but you feel like you are missing some use cases, or some points, or some populations, that's where synthetic data can help; so to generate those that are unseen, and you are able to simulate them.

Dan: How much expertise do you really need to be able to generate certain types of things? Again, if we think about-- I saw the algorithm that was working on, detecting trash on the streets, right? And so generating images of various bits of trash hanging around is a relatively straightforward process. If I'm talking about generating images of complex skin tumors or things like that, I would imagine it's a lot more challenging and you need a whole fleet of domain experts to kind of come up with that. And then if I'm generating marketing data, I can probably work with the stuff that I kind of already have, and sort of make permutations of it. So where is it really sort of challenging to kind of create the stuff, and where is it a bit easier and kind of perfectly in your wheelhouse and everyone else's wheelhouse?

Fabiana: So the expertise lies on getting the algorithms that synthesize data. Especially for tabular, they are not straightforward. So there are a lot of technical complexity around them; getting them stabilized and well working for different behaviors within the data. Because for example, if you go and check images, the distributions of images is fairly easy, mathematically speaking, so you are just talking about the caution distribution, and that's very well behaved. But that does not happen within the space of, for example, tabular data, or even sequential data-like sensors. So we bring the expertise to synthesize those types of data required.

But for example, when you are talking about the generation in a particular vertical, you have to combine what we call the core of data synthesis with business rules or business definitions or input from the business users or the data scientists that own the business logic. That's what I believe is a good principle of synthetic data generation, or even doing data science in general. Just because you have access to the best algorithms in the space of, let's say, machine learning model, it does not mean you will deliver the best machine learning model for that specific use case. Business domain, contacts, ski view, that's unfair advantage, and that's what I believe synthetic data is all about. So you can get the core, and the core needs to be strong and stable, but you always need to combine it with what the users or the business, in the end, can bring to get what is a reliable synthetic data generation.