If you work with machine learning operations in your business, there’s a very good chance you’ve heard this statistic thrown around: “80% of machine learning projects fail.” On its face, this can look rather bleak for teams just beginning to implement MLOps – but the 80% statistic doesn’t mean that every encounter with failure is entirely negative.
As your machine learning program matures, the issues you will encounter along the way will change. It’s important to have the right perspective on these challenges as signs of growth, and turn them into lessons for your team. It can be helpful to divide these into their respective categories: organization and process, the data within your model, or the model itself.
In this article, we’re going to examine what machine learning failures actually mean for an organization as well as the important information you can extrapolate from a failed process to actually improve your business operations going forward.
The Tip of the Iceberg: AI Last Mile Deployment Problem
It’s popular to talk about the “last mile” deployment issue in MLOps: this is described as the challenges faced once you have trained a model, validated their data, and data engineering agrees it’s ready for production. A disconnect still remains between the ML engineering space and what is needed for real-world implementation of AI & ML driven tools in production systems.
But what exactly do issues with last mile deployment look like and what do they mean for your day-to-day operations?
Co-founder and CEO here at Pachyderm, Joe Doliner, commented on the issue in an article by VentureBeat, saying the approximately 80% of ML models that fail are a necessity to ensure that the other 20% successfully make it to production. It’s important to remember it isn’t feasible to expect 100% of MLOps to make it to production and the majority that “fail” still provide value.
So, why do machine learning operations fail? In some cases, it could be due to structural reasons such as engineers not being able to access the infrastructure or data scientists not being able to access the data itself.
The last mile is still a significant hurdle for many projects, and it is trendy to focus on right now – but it is not the challenge that is leading 80% of machine learning projects to fail, overall. When incorporating data-centric AI into your processes, there are a variety of different challenges and roadblocks to productionized AI operations.
Digging Deeper: Where the Machine Learning Workflow falls short
Most machine learning projects fail much earlier than the last mile, for a variety of reasons we’ll discuss below. The complexity of machine learning – including access to data sets across the organization, successful model training with representative data, and all of the data and software engineering involved in building a full machine learning workflow, means that breakdowns in the process are inevitable, especially in experimental stages.
That said, machine learning is maturing as a field, to the point that leaders can have the experience and resources to prepare for and overcome common roadblocks and challenges. Some are inevitable, others are growing pains. On top of the technical complexity of ML engineering, machine learning requires organizational complexity. And while it is very valuable, cross-functional collaboration is another factor in why these programs can fail.
The Stages of a Machine Learning Project
The Machine Learning Lifecycle is similar to the DevOps CI/CD cycle, with the additional element of managing, curating and orchestrating datasets.
- Problem Definition: Understanding the problem your machine learning model is meant to solve, and defining the jobs to be done within the scope of your project.
- Data selection: Identifying data sources, and how the data for your model needs to be standardized and transformed to be processed by your model
- Model development: Selecting and building the code that will be used to process your datasets.
- Model training: Testing datasets against your model, and tuning its performance for your use case.
- Model serving: Moving your model into production. This involves the operations and monitoring of your model’s performance and accessibility to stakeholders.
- Retraining & Refining: Few V1s survive the real world. Every model will encounter scenarios that require updates, adjustments, and refinement.
Whatever the issue, the majority of ML failures can be broken down into three categories:
1. Organizational Failures
Organizational failures refer to problems that arise within your systems and policies for your business and the projects you launch. The good news is these problems can usually be resolved as they have more to do with the people and planning of the project than the data that goes into your ML pipeline.
- Goal Drift: The basis of many issues professionals run into when implementing machine learning to their project is not having a clear purpose for it. Without an end goal in mind, your project will drift from idea to idea without gaining much traction along the way.
- Project Workload: When implementing ML tools into new projects, making software purchases is sometimes necessary. To avoid buying new software, however, some organizations will have engineers create tools themselves. While this can save money, it takes valuable time and energy away from the machine learning project itself.
- Complex Ownership: Because machine learning projects are managed by different teams at different stages, uncertainty about who owns a project at what stage can create confusion or apathy on the part of the project’s current owner
- Politics: It’s sad but true – workplace politics can scuttle your machine learning initiative. A leadership change, department re-organization or quarterly forecast could put your project on the backburner.
- Turnover: Data engineers and other machine learning experts are hard to come by, especially with machine learning adoption hitting more mainstream industries. Staffing can frequently play a role in project failure, especially with a time gap between hires.
2. Data Failures
Data failures can be far more difficult to resolve, especially when you are dealing with large datasets that involve many different contributors. The key to resolving issues with your data is to be able to make complex transformations and evolve them as you learn more along the way. In other words, be sure to utilize quality data version control tools.
Typically, data failures can stem from:
- Data Quantity: When working on data-centric AI applications, you may find yourself with internal datasets that are either too small or too uniform to be of any help within your project. When this happens, it often means you need synthetic or augmented data that you can use in tandem with your internal datasets, so you increase valuable results.
- Data Quality: Even if you have enough data, it’s crucial to make sure it is useful to your project at hand. This involves getting rid of large datasets that don’t work and labeling the datasets that do. While this may seem like a pain, this type of failure is actually useful as it teaches your team better data management techniques for the future.
- Access to Data: If you don’t have access to internal data at all, your machine learning projects will be extremely limited, which basically sets you up for failure. Without access to data, you don’t have any of the groundwork laid out for your ML model to grow.
- Weak Version Control: Version control methods that can become disconnected from the data they are monitoring can lead to partial or total context loss. This can be seen with model registry and metadata stores, especially.
3. Model Failures
Model failures are one of the more elusive issues that can occur within your ML pipelines, which is why it’s important to have all the information on your side as you try to resolve the problem. These problems occur with how your ML model is processing or interpreting unexpected datasets.
It’s important to note that model failures will typically lead you back to a problem with the data itself, in which case you should look more specifically and the quantity and quality of the datasets within your pipelines.
Here’s a bit more on the specific issues you can experience with your ML model:
- Poor Data-Model Fit: Before you choose a model for your project, it’s important to understand how the data is being processed with automated version control. This will capture your data in its input, intermediate, and output stages, so you can select the right model. Otherwise, you’ll be left with subpar processing that does not deliver results.
- Lack of Reproducibility: When the problem is with the model itself and not the data specifically, things can get a bit more challenging. Modern ML models are very complex, so any extra help with pinpointing the issue is a must, and having total reproducibility is that leg-up you need.
- Model Bias: Your model can return biased results, due to biased training datasets, incorrect model assumptions, and other factors. Model bias is a complex issue that requires iterating data and model elements to resolve and build and equitable and fair machine learning solution. When bias is detected post-deployment, a model’s biased behavior can negatively impact its users, and your company’s reputation.
Leveraging ML Tools and Data-Centric AI for Your Business Despite the Challenges
The negative connotation behind the word “failure,” especially in a widely-circulated quote, can lead leaders and organizations to think MLOps is not a viable way forward when it comes to deploying machine learning in production. Like many highly experimental fields, however, these failures provide information your team can use to advance their overall machine learning strategy forward, contributing to your long-term success.
When it comes to moving past failures within your machine learning pipelines, understanding the cause and effect that contributed to these failures gets your team closer to long-term success. This is a frequent challenge in the early stages of developing machine learning programs, which can be expensive to launch into production.
A successful Machine Learning Operations initiative includes the ability to look at and understand project failures, and other issues that arise in the day-to-day practice of data science that can be seen as failure, are often also indicators of a barrier to future success that your team can problem-solve and overcome.
If you’re ready to integrate these ideas into your organization, request a customized demo from our technical team, so you can see how Pachyderm will help you with your unique ML projects.