Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

Mastering DataOps for Machine Learning

Lubos Parobek

VP of Product @ Pachyderm

As teams look to productionize their ML efforts, versioning, tagging, and labeling the data becomes even more difficult. The challenge for MLOps and DataOps teams will be to operationalize their data to better meet the needs of their end-users.

As teams look to productionize their ML efforts, versioning, tagging, and labeling the data becomes even more difficult. 

The challenge for MLOps and DataOps teams will be to operationalize their data to better meet the needs of their end-users.

Join us as we cover:

  • What is DataOps, and how are teams looking to scale their AI/ML workflows
  • The importance of data preparation (versioning and labeling)
  • How teams can automate these workflows into robust pipelines

Webinar Transcript

Hello, everyone. I'm Lubos Parobek, VP of Product and Go-to-Market here at Pachyderm. Today we're going to be talking about DataOps and how to master it for machine learning. Today's agenda is going to cover three topics. First, we're going to discuss the journey most teams take as they move from exploratory efforts in machine learning to being able to regularly and reliably deliver new machine learning applications to their customers. Next, we're going to dive into how DataOps fits into this machine learning journey and the specific challenges and opportunities of DataOps in this context. Lastly, I'll provide a brief overview of Pachyderm and how our data versioning, pipelines, and lineage helps teams tackle these opportunities and challenges and accelerates their MLOps maturity journey. Almost every organization goes through a common maturity journey as they move from exploratory analysis - in other words, trying to figure out how machine learning can help their business - to getting that first model into production, and then finally expanding that success to multiple teams, use cases, and lines of business within their organization.

Starting the Journey to Mature DataOps

Now, this initial phase is typically-- it's manual. It's slow. It can be error-prone, and it typically involves a small team that's scrambling to gather data from a bunch of different sources, doing some development locally in notebooks, and they're probably using little or no automation. And the goal of this exploratory phase is simply to establish an initial proof of concept and help separate what's real from what's the hype in machine learning. And while a lot of early ML teams really focus on this phase, it's really after this step where the real challenges begin. Actually, getting exploratory work into production efficiently and reliably and proving that you can generate real business value, entails a whole different set of challenges. You need to build out and utilize a complex set of MLOps tools to automate and productionize the ML learning lifecycle, and we'll talk about that more in a moment. But doing so begins to speed up the deployment of ML models, let's you iterate more quickly. It lets you start to scale out across many different models and allows for better team collaboration.

Finally, after one or two teams have successfully productionized the ML development workload-- oh, pardon me, workflow for their use cases and are generating real value, other lines of business will want to also leverage machine learning. Now, without some centralized guidance and tooling, every different team might try to adopt a different approach and different tools. So to avoid reinventing the wheel every time, you'll want to standardize and scale your MLOps tool chain and process across all teams in your organization. And the purpose of this is you're going to want to make it easy for teams to train, deploy, retrain, and redeploy models frequently and with confidence. And you're going to want to make sure that teams can collaborate, share, and build on each other's work to deliver results autonomously but in a coordinated fashion. You want them to be able to leverage each other's work. And throughout these iterations and this journey, you're going to see that everything boils down to two key components, and this will really be our focus in this presentation: managing your ML code as well as managing your data.

Managing Code and Data for MLOps

So as I mentioned, the machine learning lifecycle is a key thing here that we're trying to mature. And it's really the workflow for how you get machine learning applications from an idea to production. And it centers around, again, this idea of managing both data and code. And there can be some variance, depending on the use case, but it typically boils down the four major steps for most organizations. The first is the preparation of data so it's ready for experiments, training. The second is actually doing experiments and exploration to determine if your data sets and your algorithms are correct for the successful models you're trying to create. Third is training and evaluation of those models to see if they're successful for their desired applications, and lastly is the deployment of those models to production and then monitoring them. Importantly, these aren't linear steps that are executed one by one. They are incredibly iterative and are more like a living, breathing system that needs to be deployed and built out over time. Automating and scaling these processes are how teams mature their machine-learning practices.

So again, the first step in this process is data preparation. And this data preparation step is a major focus of DataOps as it relates to machine learning and can encompass activities such as data collection, often from a variety of sources, the transformation of that data, labeling of data, and finally, validation to ensure that data meets expectations and is, in fact, ready to be used in the rest of the machine learning life cycle. But it's important to note that data management activities - for example, versioning, tracking, and processing data - continue throughout the machine learning life cycle in the experiment, training, and deploy steps. In fact, DataOps has a large role to play here in terms of tracking intermediate results, data sets, metadata, and evaluation metrics to ensure the end-to-end reproducibility of results. This reproducibility helps immensely when data debugging as well as for audit and compliance requirements.

How Can Data Teams Support ML?

Let's look at some of the unique challenges that DataOps can face when supporting machine learning specifically. First, machine learning typically requires very large data sets that could often come from a variety of disparate sources. These data sets also tend to change frequently due to new or changed data. One example is using natural language processing for sentiment analysis. Large amounts of data might be coming in constantly from support emails and surveys, social media, and even voice transcripts. Scaling and automating to handle this large amount of disparate data can be challenging. Secondly, often, multiple teams will be looking to leverage the same data set for different ML applications. For example, one team might be looking to do sentiment analysis on support communications, while another team might be looking to use the same data set for a chatbot. To make things even more complex, often, multiple engineers and data scientists on the same team may want to access these same data sets independently to explore new ideas and optimizations through experiments.

Lastly, some of the most exciting and valuable ML applications, like natural language processing, as an example, use unstructured data. And unstructured data is basically data that does not fit nicely into database tables but typically lives in files. Examples of this type of data include text documents, images, audio, and video. Some of the more-- some of the more common challenges of unstructured data include being able to process a large variety of file types, file sizes, and large volumes of files efficiently. Lastly, supervised learning is one of the most popular approaches to ML, and it very much depends on unstructured data and requires the labeling of data. And this labeling of data can be very manual, expensive, and time-consuming. Of course, DataOps teams don't have unlimited budgets or time to deal with these challenges. They need to handle these new challenges in a cost-effective and timely manner that meets the needs of the data science teams they're supporting. To keep costs under control, they need to apply new automation and scaling strategies, for example, being able to horizontally scale out processing while avoiding any reprocessing of data when not necessary.

DataOps teams also need to make sure that facilitating collaboration is a high priority. They need to enable a variety of teams and individuals on those teams to be able to share data sets effectively. This typically entails a robust data versioning strategy that allows these shared data sets to be shared out via concepts such as branching that enable a single source of truth to be maintained. A successful DataOps approach and strategy will unlock several opportunities to accelerate machine learning in your organization. They include allowing data science teams to iterate faster on data. Preparing data can be very time-consuming, and that can quickly become a bottleneck for the rest of the ML lifecycle. Making sure that data science teams have the data they need when they need it for experiments and training is a key opportunity to increase the effectiveness and maturity of your machine-learning lifecycle. Another opportunity relates to reducing the manual effort and errors associated with manually triggering data flows or manually tracking data changes. Automating these tasks is a key opportunity to save time and improve reliability. A successful DevOps approach could also ease data debugging and data governance by providing a reliable and comprehensive version control system for your data sets. This is key to not only improving productivity but meeting regulatory requirements.

Expanding DataOps to Unstructured Data

Lastly, as we discussed, some of the most innovative ML use cases are being built upon unstructured data, like images, video, and text documents. Let's take a look at a few examples. NLP, for example, includes things like sentiment analysis, voice-to-text transcriptions, or analysis of financial or legal documents. Computer vision examples include medical diagnosis of X-rays or CAT scans and facial and voice recognition for authentication. Examples of geospatial analysis include analysis of satellite imagery for crop yield forecasting or recognizing logistics bottlenecks. A DataOps approach that makes unstructured data a first-class citizen will allow your data science teams to more easily pursue these high-value use cases. This includes making sure that your DataOps approach can handle ingress and processing of very large numbers of files as well as being able to handle individual files that are very large in size. For example, it's not unusual for high-resolution satellite images to be hundreds of gigabytes in size. Lastly, your DataOps approach should be able to handle any file type.

Okay. Let's switch gears now and look at what a typical DataOps pipeline looks like. Most commonly, it will be composed of three parts: ingest, transformation, and validation. Ingest is the first step and enables raw data to be brought into the system from multiple sources, like databases, CSV files, and object storage, and then combined. Ingest typically supports both batch processing and streaming. To prepare ingested data for experimentation and learning, it needs to go through varying degrees of transformation. Examples of this type of transformation include standardizing file names, converting image files to a standard format, or removing extraneous columns from a CSV file. The last step being data validation, which seeks to ensure the-- which seeks to ensure quality, usefulness, and accuracy of the data. Validation increases confidence that the data being consumed is clean and ready for use by ensuring it meets minimum expectations. Examples of validation include detecting missing data or highlighting outliers.

Now let's look at some of the key requirements for DataOps pipelines to support machine learning in particular. In terms of ingestion, we've already mentioned the need to be able to combine several data sources as well as the ability to support both batch processing and streaming. In addition, ingestion should support unstructured data sources like object storage as well as any file type, for example, image, video, and audio files. Ingestion should also be able to efficiently handle very large volumes of data that are changing frequently, for example, Internet of Things sensor data or satellite imagery. In terms of transformation and validation, it's important to be able to scale processing quickly and efficiently. Through approaches like horizontal scaling and parallel processing, it's possible to take an ML job that would take days to process sequentially and reduce it to hours. This ability to quickly and efficiently process large data sets is key to enabling quick iterations by ML teams for experimentation and training. You don't want data preparation to be a bottleneck in the ML lifecycle. It's important to also build in flexibility in terms of the languages and tools your pipelines will support for transformation. Different teams will want to use a variety of ML tools, and it's important to enable these teams to have the freedom to use the best tool for their particular ML task. Collaboration is also an important aspect to consider. You'll want DevOps pipelines that make it easy for teams to work autonomously against common data sets but without disturbing each other's work and also without unnecessarily duplicating storage or processing of these shared data sets.

Lastly, because data plays such an important role in the effectiveness of machine learning models, you'll want an effective way to debug data issues. For example, if a new training data set results in a model that's performing poorly, you'll want to be able to quickly understand the root cause from a data perspective. Also, if you're in a regulated industry like life sciences or financial services, you'll want to make sure that your data pipelines support your required data governance and audit obligations. Now let's take a look at how Pachyderm can help you build DataOps pipelines that will help accelerate your machine learning journey. Pachyderm provides two key capabilities to help you build your DataOps stack. These are data-driven pipelines and automated data versioning and lineage. Let's take a look at how our pipelines, working in concert with our versioning, can help meet many of the requirements we just discussed. Pachyderm pipelines allow you to automate your data prep tasks in their flexible pipelines. These pipelines are completely code and framework agnostic, so you can use the best tools for your particular ML applications. Our capabilities are also highly scalable and are particularly optimized for large amounts of unstructured data. Everything in Pachyderm is just files, so we work with any type of data: images, audio, CSV, JSON data, you name it. And we can automatically parallelize your code to scale to billions of files.

Also, because Pachyderm understands versions and diffs of your data, through our versioning, we can offer some incredibly unique capabilities, such as incremental processing, where we only process diffs or changes to your data and can reduce processing time by an order of magnitude or more. Lastly, it's important that we keep track of all changes to your data through versioning, including metadata, artifacts, and metrics. So you have end-to-end reproducibility, which provides immutable data lineage. This significantly reduces the effort to debug issues and helps satisfy many data governance and audit requirements. Pachyderm Enterprise is our commercial product offering. Pachyderm is container-native, and we run on any Kubernetes flavor in any of the cloud managed Kubernetes services or even on-prem. This lets Pachyderm leverage the cloud provider's core compute, storage, and networking resources. And because we're cloud and infrastructure agnostic, you have the flexibility and can easily switch between clouds or have a hybrid or private cloud system. This gives you full portability for your entire DataOps system. And all of these infrastructure concerns are abstracted away from your core ML and data scientists, who, frankly, just want to focus on iterating on your data and code to get successful models out to production rapidly and reliably.

Using Pachyderm for DataOps

Okay. Let's dive into a bit more detail on Pachyderm's key features for versioning, pipelines, and lineage and help illustrate how they satisfy many of the requirements around DataOps for ML that we discussed. First, Pachyderm's unique data versioning capabilities ensure that you capture a complete auditable record of everything, with minimal overhead. All data's versioned automatically in data repositories using a Git-like approach that supports concepts such as commits and branches. This enables that effective collaboration that we talked about on shared data sets, and multiple individuals on a team can experiment while not disturbing each other's work. It also makes it easy to identify and revert bad data changes in near real time. Secondly, Pachyderm supports petabytes of data while minimizing storage costs. We do this via a unique storage architecture that uses content-based deduplication to dramatically reduce storage requirements. Next, our file-based versioning provides first-class unstructured data support by being able to handle files of any type or size. Pachyderm also provides a complete audit trail for data across the entire ML lifecycle, not just data preparation but into experimentation, training, and deployment. In other words, Pachyderm has this capability built in from A to Z.

Built-In Data Versioning

Lastly, our versions are stored as native objects, not metadata pointers. This means that our data objects are-- this means that data objects that are committed are not just metadata references to where the data is but actual version objects or files stored so that nothing is ever lost. Now, let's take a look at our pipelines. Pachyderm pipelines transform ad hoc processes into integrated and reliable workflows. Pachyderm pipelines are responsible for reading data from a specified source or input repository, executing code against that data, and then writing the results to a output data repo. These Pachyderm pipelines, as I mentioned, are completely language and framework agnostic, and our Kubernetes native approach means that we can run any code that can be-- literally any code that can be put in a container. This lets data scientists and ML engineers choose whatever library works best for each pipeline step and makes transformation and validation of raw data much easier. Secondly, Pachyderm can auto-scale data processing without any additional code being written. Pachyderm pipelines can be configured to automatically parallelize based on the data and file structure in the repos. And this significantly increases processing speed without having to know sophisticated parallel programming approaches.

Next, Pachyderm's pipelines are automated and fire whenever new data is detected, saving developer time and avoiding mistakes. Pachyderm's pipeline engine is literally watching for changes constantly in any of the input data repositories. And each time it detects new data being committed to those repositories, Pachyderm deploys a Docker container containing your code to process that new piece of data. Lastly, because Pachyderm's pipelines are working hand in hand with our versioning, the pipelines are smart enough to know when new or changed data has been committed versus duplicate data. Pachyderm can be configured to only process that new or changed data, which will significantly improve processing time and save on processing costs. Working together, Pachyderm's data versioning and pipelines then enable you to have immutable data lineage. This means that with Pachyderm, you're able to track every version of your data and code that produced a result and maintain compliance and complete reproducibility. You'll also be able to manage relationships between historical data states and provide a complete account of every step in the journey of your data, from preparation to deployment.

Immutable Lineage Builds Data Trust

Lastly, our global IDs make it easy for teams to track this lineage and be able to rewind a result all the way back to its raw input, including all analysis, parameters, code, and intermediate results. And it's important to emphasize that Pachyderm data lineage is immutable, enforced, and automatic, and this is key. Many solutions track lineage by just logging some metadata to a database often times than in user code itself, but this requires everyone to use a system perfectly every time. And just like a chain of custody for evidence, even if you do it 95% of the time correctly, one break in the chain can completely ruin your reproducibility. Pachyderm's data lineage is fully enforced by the system for every step in the process. You literally can't run a Pachyderm process without lineage being recorded. And it's all tracked as a fundamental property of the data, completely behind the scenes, without ML teams needing to do anything themselves or in addition to their regular work.

So let's take a look at a few companies that have seen huge improvements in their machine learning operations by implementing Pachyderm. LogMeIn is a great example of a company that saw huge processing efficiencies through Pachyderm's scaling benefits. They decreased their processing time from seven days to just seven hours. Royal Bank of Canada is an excellent example of a highly regulated organization that is able to continue to innovate quickly while still meeting strict compliance requirements using Pachyderm's lineage features. And lastly, LivePerson is a good example of a tech company that's able to handle complex data processing and model training needs through Pachyderm's automation features. They've also reaped substantial benefits in terms of speeding up their team's agility by decreasing processing time from 10 to just 1 hour. And that's it for our presentation today. Thanks for your time, and I want to encourage you all to come visit us at, where you can find more product information, key studies, and sign up for a free trial. Thanks again, everyone.