When AI Goes Wrong And How To Fix It Fast

Dan Jeffries

November 1, 2021

An astounding 87% of data science projects never go live.

Worldwide spending on AI hit $35.8 billion in 2019, and it’s expected to more than double to $79.2 billion by 2022. That’s more than $30 billion dollars vaporized this year alone. If the rate holds, it means we’ll waste hundreds of billions of dollars in the next few years.

Whether it’s a result of bad data that fed a bad model or due to data scientists who can’t get access to their company’s data because of regulatory, privacy or even politics, there are a hundred reasons AI apps never make it to production.

But the biggest reason AI projects fail is something much simpler — mistakes. And the biggest mistakes come when machine learning systems can’t deal with edge cases. Edge cases destroy you in production. For example, Google Photos made the horrifying error of labeling people of color as “gorillas.” They hadn’t trained the model with a diverse enough set of faces.

But sometimes those edge cases aren’t just PR disasters or massive money wastes. They’re life and death. IBM Watson for Oncology, which promised to revolutionize cancer treatment, got cancelled after the University of Texas MD Anderson Cancer Center spent $62 million, but Watson was still making terrible treatment recommendations. In one case, the system recommended that a patient with severe bleeding take a drug that would accelerate the bleeding. The data scientists behind the project apparently couldn’t get enough real patient data so they created synthetic data sets that just didn’t match the real world.

The Iceberg and the Perfect Storm

One late evening in 2018, Uber’s experimental autonomous car made a deadly mistake in Arizona. The SUV slammed into Elaine Herzberg while she tried to cross the road with her bike. She died in the hospital a few days later, making her the very first pedestrian ever killed by an autonomous vehicle.

Lidar, radar and cameras acted as the eyes, ears and nose of the smart machine. Its brain was the software running under the hood, the ADS, or Autonomous Driving System. The car saw her first as an unknown object, then a vehicle, then a bike. In other words, the car’s artificial senses all saw her, but the machine just didn’t know what to do.

The system didn’t expect a person to cross anywhere but a crosswalk, something even a rookie teenage driver wouldn’t assume. It’s just that the car had never seen someone jaywalking. It expected people to follow the rules, something people aren’t very good at much of the time.

Even worse, the car couldn’t slam on the brakes. Uber’s engineers deliberately disabled the ability of the ADS to make decisions in an emergency because they didn’t want it hitting the braking abruptly every time it didn’t know what to do next. The problem wasn’t theoretical. Rival ride-sharing company Lyft found that its self-driving cars were jamming on the brakes when they got cut off by other drivers. The list of things that can go wrong for an autonomous vehicle “is almost infinite,” says Luc Vincent, who heads R&D for Lyft’s self-driving car unit.

“It’s a bit like a whack-a-mole. You solve one problem, another might emerge,” observed Applied Intuition CEO Qasar Younis in CNN’s excellent article on the endless difficulties faced by this “generational type technical challenge.”

To work in the chaos of the real world — with drivers honking and cutting you off, pedestrians crossing where they’re not supposed to in the dark and traffic signs that are covered in stickers — autonomous vehicle teams have had to develop some of the most advanced AI/ML pipelines in the world, in particular their ability to deal with edge cases.

We can mirror a lot of what they’ve done to make AI development safer, faster and stronger. But no one can eliminate edge cases entirely. The real world is a series of endless edge cases. But we can get a lot better at building a specialized team that can fix AI fast when it stumbles.

Building AI Disaster Recovery Team

Let’s start with the two essential teams you’ll need to deal with AI anomalies as your data science projects flow into production. Without building these teams in advance, organizations almost guarantee they’ll turn into a statistic themselves, their projects crashing and burning.

The key teams are:

PR/Public-Response Team
AI Red Team

PR/Public-Response Team

The first team is the easiest to build out, and companies can get started on this right away. This is actionable intelligence for you. If you’re deploying AI in production, you should figure out who can fill this role right now and start training them to do it.

The PR/Public-response team talks to the general public and customers when AIs inevitably makes a bad mistake. This group will need to spend time with engineers and data scientists to understand AI decision-making at a high level, with a special emphasis on the kinds of errors it makes versus the kinds of errors a human makes. They’ll develop effective ways to describe AI in simple, straightforward language to non-tech people.

They’ll need to understand damage control, have crisis management skills, and be adept at social psychology when engaging with people under stress. They’ll need emotional intelligence and the ability to understand different perspectives and be capable of bridging gaps — which is a lot easier said than done.

Most importantly, they’ll need a protocol in place to deal with crisis situations, almost like a fire escape plan. With a fire escape plan, you know where all the stairs are ahead of time, who’s in charge when the alarm goes off, and who’s supposed to grab the important files when the smoke starts to fill the office. Nobody can figure out a good plan when the fire is already raging.

This team will need ready-to-go templates that have generic responses that they can expand on it to fit a specific crisis. You don’t want someone creating a first draft when people are already outside your offices with torches. If team members are making up answers on the fly, they’ve already lost the battle.

That brings us to something I touched on in my article AI’s Phoenix Project Moment: The AI Red Team.

Red Team, Go!

The AI Red Team is the hardest team to build and the most essential. You’ll want to fill this highly specialized group with your best and brightest creative and critical thinkers. You’ll need engineers, data scientists and programmers — all working together.

Think of the AI Red Team as the machine learning version of the network security “red team.” The idea of a red team is as old as the 11th century when the Vatican would appoint a Devil’s Advocate, whose job it was to discredit candidates for sainthood. Today, companies use red teams for everything from simulating the thinking of rival companies to stress-testing strategies and defending their networks against security threats.

The job of the AI Red Team is to think of everything that can and will go wrong with AI models. They’re in charge of Murphy’s Law for machine learning. It has three major jobs:

Triage short-term problems
Find solutions for long-term problems, such as drift and hidden bias
Build unit tests and design end-to-end machine learning pipelines that make sure every model passes those tests on the way to production

The team’s first job is triage. They need to stop the bleeding when things go wrong. In the case of Google’s PR disaster, its engineers made it so Google Photos couldn’t label anything as a “gorilla.” They received a lot of flack in the media for this, but it’s actually a very effective and reasonable stop-gap solution.

The real problem is that Google didn’t go any further. The company needed to create a long-term solution to the problem, but it stopped at the triage solution. A successful AI Red Team won’t just stop at the quick fix. They’ll look for creative ways to fix the bigger problem. They can do this in a number of ways:

Expand the dataset
Either by buying another one or creating another one internally
Augmenting the dataset
Creating a synthetic data set (This can be tricky, as we have seen with the Watson example above, but it can be done if a real data set is used as a model to build the synthetic one)
Create a Generative Adversarial Network (GAN) to try to fool the model and make it stronger
Try different algorithms that work in concert with the first model as a coalition
Built a complimentary rule-based system to augment the ML model

Let’s imagine a team goes with expanding the dataset. That means it would have to work with procurement teams to find a dataset out there that’s more representative and then test it out in house. It may even need the legal department involved because team members might want to have early access to the dataset for free to test it but they’ll need an agreement in place to make sure they don’t use it in production without paying.

This type of team will need to be incredibly comfortable working across business units to get things done, and it may even need a business unit liaison to keep the wheels turning smoothly while it focuses on engineering and creativity.

Expanding the dataset could also mean crowdsourcing new images with an incentive or gamified solution, outsourcing through Mechanical Turking or just by sending people out into the streets with cameras and a waiver on an iPad.

The team could also look to expand an imbalanced dataset with a GAN, as one model tries to create data to fool the other one.

If expanding the data set, generating a synthetic data set, or trying different algorithms all fail, the team may need to build a complementary system to a “black box” ML model, perhaps a simple rule-based system that offers its own score, which it can then combine with the black box system into a weighted score.

When an AI Red Team doesn’t have a crisis on its hands, team members will want to move to stress testing and look for potential problems before they happen.

Swiss Army Knives and MLOps

To do all these jobs right, the AI Red Team will need best-in-class tools. They can turn to traditional IT for inspiration because over the past few decades IT engineers have built up a war chest of wonderful tools to do their jobs:

Auditing and logging for forensic analysis
Continuous Integration/Continuous Development (CI/CD)
Agile and DevOps
Unit tests
Dev/QA/Staging/Prod escalation
Version control for data, code and the relationships between them
Snapshots

Let’s say you roll out a new interface to a website that’s crashing. Your IT engineers would have snapshot the webserver before rolling out the code, or they may have used a container with a blue/green deployment. After finding the errors, the system rolls back to a known good state, and the development team gets back to work on a new release.

AI engineering teams will need to do the same, snapshotting the model at every stage of its development, from testing ideas and training to production. They’ll need to easily roll backwards to a known good state if things go wrong or switch to a more promising branch of development when the other branches lead to bad results.

At each stage in the pipeline, the AI Red Team will also look for the problems that cause AIs to breakdown. They’ll look for classic breakdowns like “drift,” when a model slowly or gets worse over time, its performance drifting from its original high-water mark of reliability.

We see drift in e-commerce models when buyer’s habits change over time. Maybe you loved Led Zeppelin in high school, but you’ve grown and now you like soft jazz and wine with dinner. The model needs to adapt to who you are now, but models often get stuck in a rigid understanding of who you were in the past.

The team will also hunt down imbalanced datasets or datasets with corrupt data. An imbalanced dataset has way too many examples from a single class. This happens all the time. Think about trying to classify broken or damaged products as they roll off an assembly line. A manufacturer will naturally have lots of pictures of working products and only a small subset of broken products. Getting better data on damaged widgets or artificially expanding the dataset is where the Red Team gets to shine.

Team members will want to build all their MLOps solutions on the back of strong data and pipeline management tools with iron-clad immutability, like Pachyderm, which acts as “Git for data.” They’ll want to stack that with emerging training standards like Kubeflow’s TF Job and layer it with explainability frameworks like Seldon’s Alibi, so they can interrogate the model for why it made the choices it made. They’ll want beautiful training visualization like CometML’s dashboards. They’ll need to make sure the platforms that they build are flexible and agnostic so data scientists can bring their own tools to the party. They may want to use PyTorch one day and an obscure but powerful NLP framework like Kaldi tomorrow. The infrastructure will need to handle it all with ease.

Data science projects are incredibly susceptible to the Butterfly Effect. Tiny variations in the input data, models or code can lead to radically different outcomes. If corrupt data flows into your model from a stream of telematics systems in cameras from your warehouse or your fleet of trucks and then that data triggers a training job that breaks your model, you’ll need to know when and where that data came from, as well as the entire history of its transformation.

Even the changes we don’t think about can come back to haunt us. Differences in pseudo-random number generators between libraries, changes in default switches between software dot releases, and tiny tweaks to hyperparameters can — and do — break down our ability to reproduce our experiments.

The Red Team will need to handle anything and everything that can go wrong. And when it does, it will need to draw from another key concept in software development: the unit test.

Test, Test and Test Again

It’s not enough to just test the accuracy of your models. Who cares if your Convolutional Neural Net (CNN) can detect stop signs with 99% accuracy when a few stickers or a little graffiti makes them see 45 MPH signs instead?

IBM and MIT’s ObjectNet database showed the CNNs that scored 97% or 98% on classic ImageNet competitions dropping to a miserable 55% when presented with images that had objects hidden, obscured, rotated and inverted. Just like in the real world.

Uber’s team would have had to deal with dozens of edge cases to fix the problems that resulted in Herzberg’s death.

It would need to have more data on people doing what they do, breaking the rules and crossing the road wherever they feel like crossing. Their cars would need to get much better at seeing in the dark because Herzberg crossed a two-lane highway late at night, and the car just couldn’t see her right. When their visual detection models failed to understand what they were seeing, they’d need to do a much better job of dealing with “unknown objects.” Not all unknown objects are created equal.

Maybe the worst edge case was when the CNN visual recognition system passed off an identified object to another machine learning model. Once the car figured out what it was looking it, it passed those identified objects to an LSTM, or Long Short Term Memory, system, which is great for tracking trajectory.

But there were two major problems.

The first was that unknown objects didn’t get passed off at all for tracking. At one level it made sense. You don’t want to track every plastic bag blowing in the wind and every falling leaf. But on the other hand it was a disaster because every unknown object was equal to the car. None of them were worth tracking even if they were a major obstruction like a fallen tree or a person.

The second problem was worse. Because the CNN kept flip flopping on what it was seeing, first seeing Herzberg as unknown object, then a car, then a bike it started tracking that object from scratch every time. Every time the car re-identified something, the LSTM lost all its history and started tracking Herzberg from scratch as if she’d suddenly popped out of thin air.

Those are edge cases with real world consequences in action. An AI Red Team must stop them before it’s too late. When it does, its AI pipeline will need an automatic unit test to make sure the model can deal with unknown objects and CNN failure states, and its LSTM will need to understand that mislabeled objects still have the same history. When your models retrain on new data or the algorithms get updated, the Red Team is there to make sure a ghost of the past doesn’t come back with a vengeance to haunt the project in the present.

Remember, just because you fixed a problem once doesn’t mean it won’t come back to trouble you later.

The Trust Machine

With AI making more and more decisions in our lives, how do we know we can trust those decisions? There’s only one answer: We have to build that trust into every stage of the development, creation and deployment of these powerful machines.

It doesn’t matter if you’re just starting out in ML or if your organization is highly advanced, you’ll eventually hit a snag as your model comes up against the real world, just as when NASA’s rockets go from paper to the real life hostility of heat and dust in space.

AI systems are still fragile and experimental in many ways. They’ll only get more complex over the coming decades, which will result in an increase in edge cases, quirks and mistakes. Today’s most advanced systems, like AlphaGo, are a collection of three or four algorithms, and tomorrow’s state of the art might have dozens all working in concert, each with their own error rates and problems.

AIs will be involved in hiring and firing people, evaluating people’s work, trading, making medical decisions, discovering new materials and compounds, in education and teaching, not to mention war and surveillance and emergency response systems that make life-and-death decisions every single second of every day. They already are, and that will only expand in the coming decade.

Companies will face regulatory scrutiny — and angry customers and an angry public — when algorithms screw up. With lives, money and livelihoods on the line, companies will need good answers to questions and teams in place to rapidly resolve problems, or else they’ll face serious financial and legal consequences.

If you’re “lucky” it’s just a public relations disaster. But it could be worse. A lot worse. An algorithmic mistake could mean crashing revenue and stock prices or even an ELE, an Extinction Level Event, for your company.

You can’t wait until the fire is raging to start planning.

Trying to build these teams and deploy this suite of tools on the fly will fail for 99% of the companies out there. Not everyone has the ultra-deep pockets of Google or Apple, which allows them to cobble together a response to disasters on the fly. Smaller organizations will get crushed as customers and regulators show up at their door with pitchforks demanding answers as to why they got denied a loan, or their son ended up in jail, or their wife got a 10x lower credit limit, or why a fatal accident cost them a favorite daughter.

When that happens, it’s already too late.

Start now.

Because AI is already here. And its decision-making will either revolutionize your business or cost you everything.

Request a Demo MLOps Done Better

When AI Goes Wrong and How to Fix It Fast

November 1, 2021

The Iceberg and the Perfect Storm

Building AI Disaster Recovery Team

PR/Public-Response Team

Swiss Army Knives and MLOps

Test, Test and Test Again

The Trust Machine

Read More about MLOps

Treat Data with the Rigor of Code by Building Datum-Centric Pipelines

What does practical MLOps success look like?

The Fragmentation of Machine Learning

November 1, 2021

Share

The Iceberg and the Perfect Storm

Building AI Disaster Recovery Team

PR/Public-Response Team

Swiss Army Knives and MLOps

Test, Test and Test Again

The Trust Machine

Read More about MLOps

Treat Data with the Rigor of Code by Building Datum-Centric Pipelines

What does practical MLOps success look like?

The Fragmentation of Machine Learning