VP of Product @ Pachyderm
Data Advocate @ Superb AI
As teams look to productionize their ML efforts, versioning, tagging, and labeling the data becomes even more difficult. The challenge for MLOps and DataOps teams will be to operationalize their data to better meet the needs of their end-users.
As teams look to productionize their ML efforts, versioning, tagging, and labeling the data becomes even more difficult.
The challenge for MLOps and DataOps teams will be to operationalize their data to better meet the needs of their end-users.
Chris: Hello, and welcome to another Pachyderm webinar. My name is Chris and I'm excited to welcome you all today. We have a great session lined up. And before we get started, I'd like to go over a few housekeeping items. If you have any video or audio issues while watching today's webinar, please try refreshing your window or browser. Today's webinar is being recorded and we will share the recording at the webinar. In addition, if you have any questions for the presenter, please be sure to use the Q&A widget on your screen. If we don't get to your questions today, feel free to reach out to the presenters or push your question in the Pachyderm Slack channel. Today's webinar is called Automating DataOps at Scale for Computer Vision. And with that, I'll hand it over to our speakers to introduce themselves.
Lubos: Hey there, everyone. My name's Lubos Parobek and I'm VP of product here at Pachyderm.
James: Hey, everyone. My name is James Le. I'm currently leading data relations here at Superb AI.
The first is data collection, prep, and labeling, often from disparate sources. Next is experiments and collaboration to determine the best data and the best algorithms required for a successful model. Thirdly, we move into training and evaluation to determine if the selected models are actually performing as we'd like for their applications. And then lastly, we'd actually go deploy and monitor these applications so that we can assess their performance in the real world. And it's important to note that this isn't a linear process that goes step by step, but rather you can see there's some concentric loops in there where we go back and inform the previous steps based on how the latter steps are performing. And mastering and scaling this process is how teams mature their machine learning life cycle and it takes us into the next slides here.
So typically, what we see is that every org goes through a common maturity journey as they move from exploratory efforts around ML to productionizing their first models and onto scaling across the organization. And often teams will spend a lot of time in this first exploratory phase. And this is all about proving that they can use ML to provide business and customer value. And this first process is often kind of manual, slow, and error prone. It often involves a small team scrambling together, data scientists, data engineers to gather data from a bunch of different sources, do some work locally in notebooks. There's typically very little automation. And again, the goal here is really to just get that initial proof of concept out that shows that there's real business and our customer value.
Now, well again, there's a lot of focus in this initial step, it's really in the productionalized step that we start to see first value emerging. And in this step, it's really about how do you build out and utilize a set of ML's tools to automate the process and productionalize that machine learning life cycle that we talked about. And by doing so, you start to really see the deployment of ML models and let's you scale out across many different models and allows for better team collaboration. And it's all about being able to get these models out in terms of iterate on them effectively in order to improve the business and customer results you're looking for.
And then finally, after one or two queries cases have been rolled out successfully, you start to figure out, well, how can I leverage this across the entire organization. How can I have every single team utilize these best practices? And this is where part of the biggest challenges emerge. Often, you'll see different teams want to use different tools. And so there's this need to standardize and scale MLOps tool chain across all the different teams in your organization. You're going to want to be able to train, deploy, and then retrain and deploy models frequently and with confidence. And you want teams to be able to collaborate and share and not have to reinvent the wheel as different teams jump into ML. And you want these results to be-- you want these to be able to deliver results in other words autonomously but in a coordinated fashion. Again, being able to reuse those best practices and tools that you've been able to put together. So before we head to the next slide, let's pause there and look at a poll basically talking about where are you in your machine learning life cycle. So we'll give you a few seconds to fill that out. And Chris, let me know when we're ready to move to the next slide.
Chris: That's good. And for folks who don't know how to use the platform, on your top right, you should be able to see a poll button, and you should have the poll right there. So feel free to put your response in while I give our own folks a few minutes to collect the responses there.
It looks like just wrapping up on the poll now. People are still submitting, but it's looking to be a good percentage of folks are exploratory, a good percentage are productionize, and about 10% are at scale right now.
Lubos: Awesome. Awesome. Okay. Great. Well, one of the interesting things and unique things about AI and ML is that as you're going through this maturity journey, there are two key elements that everything boils down to: managing your ML code and managing your data. So the cool part which is above the ice break there, your notebooks, your scripts are really everyone talks about and focuses on. For example, what new algorithms did you use? What libraries and languages show the most promise? Are there new approaches like AutoML that allow you to write those code? But what we've seen is that while there's a lot of attention on the code side of things, often the biggest challenges that impede ML maturity actually lurk below the surface. They're in the data, kind of below the water in this iceberg image. And because the data is much harder to see, and manage, and to scale and it's often an area that a lot of teams aren't familiar with. And it's important to note that when we talk about data in this context we just don't mean raw data. We mean all the intermediary data artifacts that are produced throughout this process. So for example, clean data sets, labeled data and annotations, model feature artifacts, evaluation metrics, metadata. All this data needs to be developed and managed and versioned in order for us to mature our ML practice.
So one of the things that is fascinating about ML is all of the analogies around this ML maturity cycle to DevOps. And so let's go and look at some of the key concepts in DevOps and how they map over to MLOps and the data part of these two concentric loops. So let's say with code versioning, just like you need code versioning for your source code, you need data versioning to track version and understand data changes as you go through the different steps of the life cycle. CI/CD is the same. CI systems allow you to reliably release code that's tested and then gives you confidence that when you release that code it's going to work. The same is true for data pipelines. So data pipelines allow you to automate and scale data processes and chain together different dependencies. This is important if you really want to automate and scale and enforce the best practices in the MLOps chain. Just like you wouldn't push code into production until it passes CI, you really don't want to push a model that you've developed in a notebook until you've gone through a set of pipelines to validate it.
Another good example and analogy is debugging tools. So there's a host of, of course, code debugging tools that allow engineers to quickly find the root causes of any bugs and issues. And you want a similar set of features on the data side so that you have data reproducibility, data diffing, and data lineage that lets you quickly go back and see where the problems are with your data, debug problems, and also find the full lineage or reproducibility so you understand how you got to the end result with your MLflows. The last area is around microservices. So microservices and DevOps were really a revolution in allowing teams to act in an autonomous manner without reinventing the wheel. And so they did this through contracts, through service interactions, and in APIs. And working with data, you can do the exact same thing in an ML workflow by marginalizing your data flows so that they can be cloned and built upon and used by different teams, and again, enabling collaboration while also working in a very autonomous manner. So with that, I'd like to turn it over to James to dig in and double click into how to connect DataOps with these ideas around MLOps and the ML lifecycle.
James: Thank you, Chris, so much for kind of providing those high-level understanding of ML maturity as well as how ML can really learn from DevOps. So yeah, just kind of remaining on that track, I wanted just to say-- I wanted to talk a little bit about this concept of DataOps which really originate from the world of data analytics and the AI world. So the main difference between DataOps and DevOps is that DevOps transforms the delivery of software systems used by software developers. On the other hand, DataOps transforms the delivery of intelligence systems and analytics models viewed by the data analysts. And their issue is so as you observe here in the slide, the goal of DavOps is to synergize engineering at the operations and quality insurance to lower the budget and the time you spend on the software development and release cycle. So with DataOps, we have an additional layer of data, so the goal of DataOps is to synergize data engineering, data analytics, and at the operations to improve the efficiency of acquiring raw data, meeting data pipelines, and generating actionable insights. Now, if we're talking about using data for machine learning applications and we need to add an additional layer, data science and ML engineering into this equation, aka MLOps. So just like how DevOps has made a tremendous impact in software engineering teams, DataOps can also fundamentally redefine how data analytics teams, and then later on, ML teams function. Because without DataOps, there is no connection between data pipelines and no collaboration among data producers and data consumers. So this will inevitably lead to manual efforts, deprecated code, and increased number of errors, and a slower time to market.
So in the next three slides, I'm going to argue for the case of bringing the DataOps discipline to the real-world development and deployment of [inaudible] system. The first reason is that in most real-world machine learning projects, which include computer vision, the data is more important than the models, right? So getting better data might be the single best bang for the buck in terms of improving your model performance. This is having high contrast with academic machine learning which emphasize on the modeling component rather than rethinking the data product. And so there are so many data-centric low-hanging fruits that we're currently missing as you can see here in the slide as proposed by [inaudible] like data creation, data cleaning, data annotation. And so we can always find additional pieces of data that provide insight into a completely new aspect of the problem rather than tricking the last function, for example. Whenever the model performance is bad, we should not only resort to investigating the model architecture and tweaking our [inaudible]. We should also attempt to identify the root cause of bad performance in the data.
So the discipline that DataOps really helps you explore, sample, and collect only the data points that, first, are worth being labeled, and second, give the most value with respect to a given task. And so best practices in DataOps capability usually can help you accomplish some of these tasks without too much hassle. Another reason that I'm going to bring up is that unstructured data preparation is very challenging. If you have a lot of errors in your labels, then you're going to create a lot of errors in your model as well, right? I do this. The growing number of data source is different, nature of modern data, and the increased complexity of downstream usage. It becomes very challenging to ensure that data quality [inaudible] label quality. So in computer vision, practitioners got to deal with unstructured data that is schema-free and does not accommodate to older forms of data storage, processing, or analysis. And so preparing massive volume of unstructured data in computer visions pose a variety of challenges.
For instance, dealing with the unpredictable nature of [inaudible] video, managing the labeling workforce, allocating sufficient labeling budget, addressing data privacy, and much more. So before carrying on with the modeling component, the labeled data really needs to be updated for correctness and quality. This can only be done via reverse data processing, data transformation, data augmentation, data visualization, and label validation to see if the data can serve as part of the training set. So I believe that DataOps can help teams to spend time on gathering high-value data and create valuable guesses by filtering out all the irrelevant data points using best practice with continuous testing, for instance. And [inaudible] also mentioned a bit earlier on the slide about CS/CD, right? So continuous integration to this department. With that, just simply, data engineers can automate the data preparation workflow more seamlessly.
And then, finally, building computer vision application is [inaudible] so the slide that-- the workflow diagram that you see here on the screen, it come from the top given by [type of progress?] to discuss the two loops of building algorithmic products. So the first loop is called the algorithm development loop, so it enters three separate step. First, we view a scientist build the algorithms using some sort of algorithmic frameworks like TensorFlow, [inaudible], or [inaudible] for instance. And then the algorithm is going to be measured against a version tester, right, using [an addition?] spec created by the product owner and some other [inaudible] testing tools in-house. And finally, up to [inaudible] you've got the evaluate metrics and you can learn from those metrics what are some of the failure cases that you use some sort of error analysis or manual review to dissect some of those failure cases and then come up with new ideas to build the new algorithms, right? So that is sort of the first loop.
The second loop is the product development loop. So there's also three separate box or phases with this loop. First, you build a product. You build your product like a [job?] software engineer can build a product using microservice or any other infrastructure tooling in-house. And then you deploy the product into production and you measure the performance of that product in production using monitoring and logging tools. And then, after you've got the measurement in production, you collect live performance data and you can learn from the failure cases [industrial usage?] via dashboard or analysis of live performance data and using some of the insights from that learning to close the loop and building the next iteration of the product, right? And so really the [inaudible] in DataOps can speed up the iteration both of these two loops, which they can work together [inaudible] way if we take a data [centric message?]. So the first way is you can go from step three, building [inaudible], to step four, right? So after building the algorithms the scientist can provide those algorithms to the ML engineers and then they can convert them into the computer vision parts.
Another way that these two loop can work together is going from step six to step one where after the scientist learning from the failure cases of the product in production they can sample and annotate only the [inaudible] error-prone data points. And then they will combine those new data points with existing trends that you have. So to create a new training set for the next iteration of the outcome development of-- so that's another way to-- the idea of curate better data set for the next iteration. So really software engineering best practices, which is like a small component of DataOps, can help ensure that both of these step that connect these two loop are standardized and carry out without too much issue.
So this is sort of my proposal, almost like the work-in-progress ideas for what an ideal DataOps stack might look like for the modern computer vision teams, right? So in computer vision, DataOps really means building a high-quality training set. So this entails us to ask a lot of questions such as what data, where to find that data, how much data, where to validate that data, what defines data quality, where to store that data, how to organize that data. So given the complexity, and a variety of scenarios, [inaudible] question we need to have specific phases of the data pipeline to address each of those almost like pillars of data development. So the first phase is data acquisition. So when we talk about data acquisition, we tend to think mostly about data collection. So there are both physical and operational considerations with this, ranging from where to collect the data to how much to collect initially. And if you want to get either question right, we need a feedback loop that align with the business context.
Besides data collection, we can also generate synthetic data. So there's so many different use cases ranging from facial recognition in life science to e-commerce to autonomous driving. So how to getting this synthetic data required monitoring data, as well as compute. And furthermore, synthetic data is not realistic for some of the unique use cases. And there's a lack of fundamental research on the impact on model training. A third approach to data acquisition is known as data scavenging, where we scrape the web and use open-source data set to get your data, like think about [inaudible], academic benchmark, and in Google data set search. And finally, we can also acquire data by purchasing them online or via a third-party vendor. The data are legally collected and well organized. So as you see even within that small block acquisition, there's so many different ways that we can acquire them. Collect just data, scavenge them and then push [inaudible], right?
The second phase of the DataOps pipeline is data annotation/lead labeling as we all know it. So this is an industry of its own because we have so many questions to answers. First, who should label the data? Is it using humans or using the machine? We want to try out a human-in-the-loop approach, Do you want to train your annotators or use [inaudible] from Mechanical Turk? Maybe what vendor you choose or what annotators that you will look for, right? Second question is how should the data be labeled? If you use the tools whether it should be open source, in-house or on-prem solution. And furthermore, how should the labeling instruction be written? This is also an important scenario. And then third, what data should be labeled, right?
So getting the data annotation step right is extremely complicated because it is error-prone, slow, expensive, and often impractical. Efficient labelling operation requires a vetting process, qualified personnel, high performance tools, instant lifecycle, aversion system, and a validation process. Even after getting the label, we are not done yet. We need to validate labels. So this can be accomplished manually by annotators and internal team or a third party. So to validate the label, there's a couple of challenges like how do you vet the annotators in events, how do you separate honest mistake from fraudulent cases, how do you do cross validation and statistical analysis for explicit and implicit quality assuring? Or how do you file labelling errors during model inference [inaudible]? So those are some of the challenges that we have to think about in the second phase.
The third phase they are debugging. So this really entails writing expectation tests to address the data that we're processing and get a start system. So essentially, these are the unit tests for the data. The desire to catch data quality issues and vet data before they make their way into the DataOps pipeline. Then the fourth phase is data augmentation. So data augmentation is a scientific process where we can manipulate the data via flipping, rotation, translation, changing color, for example. However, scaling data augmentations to bigger data set [inaudible] memorization and handling [inaudible] are some of the fundamental issues you have to deal with when you apply augmentation. And then sort of the test there, DataOps apply, I call it, data transformation. So within the data transformation phase, there's also three separate steps. The first step is data formatting. So this is a small slide. The whole data engineering task, the interface with data warehouses, data lakes and data pipelines.
Second step is feature engineering. So this includes concepts to trust, feature [inaudible], measurement of correlation, management of missing records, and going from feature selection to feature embeddings. And then third step is data fusion. So it means you want to fuse data from different modalities, different sensors, and different timelines. And I [inaudible] more about this sixth block, the final phase, called data curation which I think is definitely one of the most important and underinvested phase currently in the modern competitive stack. I think that it can serve as the bridge between the DataOps pipeline and the MLOps pipeline. And the MLOps pipeline essentially, it's about model experimentation, model training, model deployment, model monitoring. All the phase, that is much more focused and more [inaudible] the data set, right?
So data curation is essentially the belief that because the data set is so large, we cannot be picky about the types of data that we're going to use and thus we need to cuddle up and structure them and only curate the one that make the most sense for-- that make the most sense to be used for your training set. And so I believe that in any data set, there could be high-value data that is useful, redundant or irrelevant data that is useless, and then mislabeled and low quality data that is harmful, right? So you've got useful, useless, and harmful. So you only want to curate the useful one that you want to use for your training set. So yeah, that's kind of my idea of how this DataOps [inaudible] can look like. And then there's a final phase here that you see called [inaudible]. It perform badly on some [inaudible] cases. You can dissect [inaudible] in finer [criteria?] and then use them as the new data point for your new training set as the second iteration of your development. So that's not the way the data centric [inaudible] looks like.
And so both Superb AI and Pachyderm are members of a relatively new organization called AI Infrastructure Alliance, AIIA for short. So the goal of the alliance is to bring together the essential building blocks for the AI applications of today and tomorrow. So right now, we are seeing the evolution of canonical and ML stack. So it's coming together through many different people, partnerships, and organization. So no one group can do it alone. And that's why we crated the alliance as a focal point that brings together many different groups in one place. The alliance and its member, which I think right now is probably at 50 to 60 members, bring clarity to this quickly developing field by highlighting the strongest platforms and establishing clean APIs, integration points, and open standard for how different component of a company and enterprise and ML stack can and should interoperate so that less organization make better decision about the tools that they deploy in the ML application stack of today and tomorrow.
So us [inaudible] here in the workflow diagram, right, Superb AI product cover the data access, data injection, and data label [inaudible]. On the other hand, the Pachyderm product cover the data versioning, metadata store, experimentation engine, training engine, and data engineering orchestration stage of the stack. So yeah, this is sort of how we feed it to the stack and we also collaborate a lot of other vendors with an AI to figure out why the tools can interoperate and allows you to get more options for your own use cases. So now I'm going to switch it back to Lubos and let him discuss some of the benefit of using the Pachyderm product.
Lubos: Awesome. Awesome. Thanks, James. Yeah, absolutely. So we're going to-- next, both James and I are going to kind of dig into both Pachyderm and Superb AI to give our audience here an idea about how we can actually do this practically with the two products. Solet's talk a little about Pachyderm. So Pachyderm is the data foundation for machine learning. And what we provide are data to run pipelines and data versioning and lineage that supports the entire machine learning loop. Kind of what James just walked through. And a simple way to think about Pachyderm is we find a way to apply your data against your code in an iterative and trackable manner, again, throughout this entire lifecycle. So let's look a couple of the kind of key features and benefits of using Pachyderm as your data lair for MLOps.
So first, Pachyderm pipelines allow you to add a lot of automation to your data tasks through our flexible pipelines. So these pipelines are completely code and framework agnostic so that you can use the best tools for your particular ML applications. The market right now is moving extremely quickly and so when your teams discover brand new tools they want to use for data transformations and curation, we want to make sure that they're able to be used within our data lair. Secondly, our capabilities are also highly scalable and particularly optimized for large amounts of unstructured data. So think images, audio, video, genomics data, JSON files, etc. Everything in Pachyderm is just files, so we work with any type of data, and we can automatically parallelize your code to scan the billions of files. So there is no additional code you need to write to be able to use the parallelization that come with our versioning and the pipelines. Because we understand versions and diffs of your data, we can offer some incredibly unique capabilities, such as incremental processing. So we can only process when we recognize there are differences in the data or new data being added, and this can reduce the time it takes to process data by order of magnitude.
Lastly, we keep track of all the changes to your data, and so this includes: metadata, artifacts, metrics. Again, in order to effectively automate the entire ML loop, you need to be thinking about data beyond just the data prep stage into experimentation, training, deployment. And one of the key things about Pachyderm is that we fully enforce this data lineage through every step of the process. We use Git-like reposts, commits, and branches to do so, so it's very intuitive. And you pretty much can't run a Pachyderm process without lineage being recorded. It's all tracked as a fundamental property of our system, and behind the scenes so ML teams don't need to worry about it or take any explicit action to get versioning and the reproducibility. So one of the common questions we get is, "Hey, we're early, we're just exploratory, or just thinking about productionizing kind of why worry about versioning, why worry about reproducibility now? Why can't I worry about it later on?"
And by kind of kicking the can down the road it really causes some pretty serious ramifications down the line that can take months, if not years, to correct. And so a few examples. So without robust and forced data versioning, you end up with what we kind of call data entropy. So projects will inevitably end up as a mismatch of loosely connected Jupyter Notebooks, they get deployed manually, and therefore you loose reproducibility. You end up with data spread all over multiple systems, data lakes, file formats, and you only have a vague idea of who's using what data. And even worse, you really aren't able to determine what data was used to generate a particular result. ML also has a special capability for ensuring [tactical depth?]. It's got all the traditional challenges of software engineering plus an additional layer of complexity around machine learning. So for example, you train a model but lose track of the input, as I mentioned, in the parameters used, there's literally zero ability to reverse engineer that, say, if you're dealing with a terabyte of data.
Another area that gets very complicated very fast is putting together glue strips for pipelines and basically trying to reuse these for other purposes. It makes it very difficult. There's often hidden dependencies, and so without kind of the modularity or reusability, you end up kind of reinventing the wheel or spending lots of time kind of rewinding things to figure out how to reuse work that you've already completed. And lastly, without this kind of built in reproducibility, you immediately get into some compliance risk, especially if you're dealing with sensitive data or you're in a regulated industry. So if you want to be able to move quickly without having compliance teams coming down on you in terms of how you're getting your results, you're going to really want to build in this reproducibility early, as early as possible.
So lastly, before turning it back to James to tell us about Superb AI, just a few examples of typical customers and use cases that'll be getting value out of Pachyderm. The first is Anthem. They're a great example of a life sciences company that's seeing huge processing efficiencies through Pachyderm's scaling benefits, so they're seeing 70 to 99 percent efficiency gains. Another great example is Royal Bank of Canada. They're an example of a highly regulated organization that's able to continue to iterate quickly while at the same time meeting those compliance requirements that we just talked about. And then lastly, LivePerson, as an example of a tech company that has got some pretty complex requirements in terms of data processing and model training and is able to achieve those through some of the automation features in Pachyderm. We've got a ton more case studies on our website so if you're interested in more details or seeing more examples, please visit us there. So with that, I'll turn it over to James to give an overview of Superb.
James: Yeah, thanks a lot, Lubos, for bringing up the Pachyderm product. And one of the reasons we partner is that focus on sort of what he said the data foundation layer of the stack, right? So we're really sort of focused on the-- the product really is designed to have ML teams to drastically decrease the time it takes to deliver high-quality training data set. And the goal, instead of relying on human labor laws for the majority of the data preparation workflow, the teams can now implement a much more time- and cost-efficient pipeline using our product. So initially we really want to tackle this data-labeling problem in automated fast fashion. And our first approach to data labeling looks like the diagram that you see here on the screen. So first you inject all the raw corrected data into our product, then label them using just a few images. Then you use one of our [inaudible] called Custom Auto-Label to auto-label those data in a very short amount of time without any custom engineering work. And after that is done, you can apply that new Custom Auto-Label pre-train model to the remainder of your data set to instantly label all of them. And then our Custom Auto-Label Model will also tell you which images need to be manually audited, along with the mode of prediction using some of the in-house uncertainty estimation methods.
So once you finish auditing and validating the small number of hard labels, you are ready to deliver the trained data to your-- for more training, right? And then the modelers, the scientists can train the model and get back to you with the request for more data, right? So if your model is low-performing, you probably need new data set to [off-manage?] your existing [inaudible] data set. So you can run this new data set using [inaudible] personal custom auto-label prediction model and upload the model prediction into our platform. Then our platform will help you find and relabel the failure cases. And finally, you can train our auto-label model again on this [inaudible] to drive the performance up to kind of close this feedback loop. This cycle repeats over and over again. The idea is with each new iteration, your model will cover more and more edge cases and thus improve the accuracy performance as a result. So just to kind of double-click on this customer auto-label capability that I've been talking about in the previous slide, it's really a product that enable computer vision teams to do some of these things. First, to quickly speed up the model being trained on the specific data set for rapid labeling. Second, to automatically surface how labels with combination of understanding the estimation and deep learning techniques. Third, to build optimized [inaudible] data set while retraining models for officialization. And then fourth, to expand to a number of use cases in rare scenarios with unique conditions and heavy subject matter expertise.
And so just kind of breaking down some of the key capabilities almost like research technology that are being [inaudible] into our custom auto-label product. So really, number one is to reduce human verification time using uncertainty estimation which use [inaudible] under the hood. So we develop these in-house techniques. Essentially it tell how a custom auto-label model can measure how confident it is with its own labeling prediction. In other words, the custom auto-label output, the annotation, and then simultaneously upload how confident it is with each annotation. And therefore, it requests human verification only in the cases that its uncertain about and thereby reducing the amount of work that goes into manual labor validation. Secondly, the custom auto-label will also adapt to new tasks with few data. So besides on the common task-- and another thing, object classes like a car or a person, there are so many mirrors of different other object classes their domains [inaudible] task, right? So generally, training a model on this new set of process domain on tasks require a significantly large amount of labeled data. Until then, you probably had to rely on the manual labeling process. In order to remedy this problem and help our users benefit more from custom auto-label on long-term data, we use a combination of transfer learning and fissure learning in-house to quickly adapt and tailor the proprietary models to draw data in your specific application domain.
And then finally, custom auto-label can also exploit the labels that come with the data for free. It utilize self-supervised learning to retrain the model on some of the popular application scenarios for computer vision. So let's say if you work on one of those long-term scenario, like a very niche scenario, you can select from our list of pre-trained model that have been self-supervised on each of the scenario and then apply that for your domain-specific data set and that can help a little bit with getting a reasonable level of accuracy. So yeah, although with a custom auto-label, computer vision teams can reach a model that they have using a very small amount of their data without any custom engineering work to audit more operation faster so they can put more focus on some of the business critical aspect like model [hospitability?], scalable infrastructure for some of the long-term problems.
So I want to quickly go over another feature or another capability that our product offers which is validating label of scale. So this is a very vital step to ensure label quality. Why is label validation essential? It is because every labeler is different, right? It's almost impossible to train everyone to be 100% accurate at the beginning of the projects. Labelers are human so a mistake can always occur, or at least occasionally. So this mistake can quickly diminish the precision and value of the resulting model prediction. However, while necessary, label validation can be time-consuming when done in a ad hoc manner. So it is quite crucial to have a well-defined and streamlined review workflow in place. So we recently released the manual review which is a powerful new set of features being viewed to streamline the label validation workflow so that you can consistently collect high quality labels without significant efforts. And so this manual review released fits seamlessly into the existing workflow of the users apart from product which is labeler or the review and the data project manager. And as you see here in this slide, that's two different step that a label go from raw data to finally being confirmed and validate and then use that for quality assuring.
Overall, this workflow is a straightforward and modified process for cross-tracking that [inaudible] label data, therefore improving the label data consistency in the long run. Additionally, it enhance project [administration?] by [inaudible] its side to what needs to be reviewed or [reworked?] at the label review in data project manager levels. So I can quickly talk about one of the case study [inaudible] recently we worked with called Fox Robotics which is a London-based startup that develops robotics technology. So the solution being viewed to solve some of the logistic problem across multiple areas such as manufacturing, aerospace, health, agriculture, and retail. So they team up with us to label the images faster using custom automation [inaudible] to their exact use cases thereby accelerating the time to market for their autonomous robots and agriculture automation solution. Some quick numbers. They observed a 72% reduction in the cost per annotation and five times faster labeling per images and therefore their label accuracy also observed a significant improvement.
And so given the pressing need for better tooling to support that development, both Pachyderm and Superb AI have recently combined forces to bring data labeling and their versioning to [inaudible] operation workflow and we will acquire an API integration that essentially will provide an automated pipeline to version the label data from Superb AI. So if you would like a data engineer or a DataOps practitioner you get both benefit from the Superb AI product to ingest your data, label the data, and administer Azure labeling workflows. Additionally, you can get all the benefits from Pachyderm to version and automate the rest of your ML life cycle. Just one quick sentence on the technical aspect, based the pipeline automatically pull the data from the Superb AI platform into the Pachyderm platform and then versioning it as a commit. And this work by creating a Pachyderm secret key for your Superb AI access API key. And this key can then be used to create a pipeline that pulls your Superb AI data into a Pachyderm data repository. And we have a [inaudible] that will be sent to the attendees later on if you want to take a look. But the idea is once you have that label data impacted, you can build the rest of your ML [inaudible] to test, pre-process, and train the model.
So yeah. Overall this is exciting work that we've been working on. And would love to kind of hear some of your feedback on what are some of the different ways that we can interoperate with between us and between the rest of other tooling infrastructure companies in the ecosystem to ideally have you prepare data more easily and more efficiently for your computer vision projects. And with that, here are some of the resources and that we kind of mentioned throughout the presentation. We wrote a blog post as you can see here in the slide that kind of discuss more of the details of the integration. And then here are the two links that you can use to try out-- get a free trial both of our product, Pachyderm [inaudible] and the [inaudible]. And with that I turn over to Chris to kind of have some remarks and facilitate the Q and A.
Chris: Awesome. Thanks, folks. Great presentation. Looks like we have just a few minutes left for Q&A. If you haven't already sent your Q&A question please do so now. Just put it in the questions channel and we'll get to it as soon as we can. Looks like we have one question that came in. First question is what are the critical capabilities of an ideal DataOps platform? Let me hand it to Lubos first. And then, James, you want to jump in afterward with your comments as well for that.
Lubos: Yeah. That's a good question. And I often think about it in terms of my previous line, in terms of DevOps and what are some of the best practices there that we can learn from. And just like DevOps and DataOps we're trying to iterate quickly in a consistent and reliable manner. So there's a couple characteristics that that points to. So one of those is automation. And James talked about the auto-label model as a good example of that, right? But we want to basically, in order to iterate quickly, do that reliably we need to kind of remove human error from the system. And so thinking about what are the different automation capabilities that you can build into a DataOps platform to remove human interaction is a key thing. Another thing to think about beyond automation is scalability. So we're dealing with data, often unstructured data, in our example today. And so thinking through what does that mean for my platform if I'm dealing with petabytes of data, is critical. And so thinking the ramifications there from a versioning perspective, a pipeline perspective, etc., is key. And then the last is, of course we're talking about a platform and a platform connotates that we want to be able to leverage this across our organization. We want to have different teams be able to use it. So what are the aspects there that we need for reusability but also autonomy? So how do we have different teams utilize [inaudible] platform but also be able to handle their unique requirements. So the automation, scalability, collaboration, autonomy are some of the things that I would definitely think about in terms of a DataOps platform.
Chris: Cool. James, I guess anything to add for what critical capabilities folks need to add to the DataOps platform?
James: Yeah, I mean I did have that slide. I think that this-- I think really how do we-- like Lubos was mentioning, reusability and some of those other automated questions. How can we go from acquisition to maybe curation faster, essentially? Because right now the workflow is quite manual and so I think we need sophisticated tooling that really tailor to the customer needs. So critical capabilities, some data performability, just ability to adapt quickly to specific data set because I think it's hard to generalize to all the possible use cases. So yeah, that's [inaudible] I wanted to add on and I think really data curation is really the key part of that puzzle.
Chris: Got it. Perfect. It looks like we have time for just maybe one more question here. Apart from DataOps, do Pachyderm or Superb AI also help in MLOps in some way or other? I'll pass this off. James, did you want to get the first stab on this question?
James: Yeah, so I think from a Superb perspective, we collaborate with other companies that have a focus on MLOps. So our goal is how can a practitioner use different tools to tackle the end-to-end overflow. So already right now our product focused on the data component, but we are doing integration with, let's say, companies that focus on monitoring or experimentation. And so we'll talk more about that. We could release more some of those in the upcoming months. But the idea is as a practitioner, you can utilize that integration and then use our partner modeling or monitoring capabilities for the ML stuff. So yeah, not direct focus on that, but that's a byproduct of our partnership and I think that's pretty helpful way of looking at things because it's really hard to develop something that is cover all possible problem in a classical development. Yeah.
Lubos: Yeah, I guess I would think about it from the perspective of, again, that kind of lifecycle. So we talked today a lot about the prepare step and how Pachyderm and Superb AI help with that. But that data flow, the need for data versioning, it really extends throughout not just the prepare step or DataOps but continues into the rest of the lifecycle. So whether you're running experiments and needing to kind of keep track of what algorithms and data resulted in a particular result, or once you're in the training step and being able to keep track of version models, there continues to be this need to keep track of both data and the associated code. And so that's a key place where Pachyderm can help into that kind of ML lifecycle.
Chris: Great. And it looks like we are just about out of time. I just want to take a moment to say thanks everyone for joining us today's webinar. This again is being recorded and we will send out the recording link after this is done. If you, of course, have any questions for the Superb AI or the Pachyderm team, please feel free to reach out to us and email us for your questions. And with that, thanks to speakers and thanks everyone for joining us today and we'll see you at the next event. Thank you.
James: [inaudible] attending and yeah, I hope you learned something new from this presentation.