Completing the Machine Learning Loop

Pachyderm’s ability to version data and run pipelines at scale is at the foundation of bringing ML to software development. In a new blog post, Completing the Machine Learning Loop, we see why treating data as a first-class citizen in machine learning (ML) development is necessary to enable fast and reliable iteration on AI software. The two loops model presented in the blog, gives us a mental framework for how to apply MLOps and DevOps best practices to the two moving pieces of ML development: code and data. This understanding is essential for anyone working to deploy sustainable ML-based systems into  real-world environments.

“One day all software will learn…” How many times have you heard that quote at some machine learning talk?

As an ML engineer and researcher over the last decade, I’ve heard it so many times it makes my eyes roll. Because, when you’re applying machine learning to real world problems, it’s a necessity to separate the hype from reality. And it’s hard to imagine this utopian future when I struggle daily with seemingly trivial failure points in NLP and Speech Recognition models.

This may come as a shock to some, but machine learning development is not done when you achieve a good score on your test set. If anything that gives you a good starting point for a production system, but as soon as the model encounters real-world data, that’s when development begins. Whether it’s data drift, performance improvements, or even black swan events, you must monitor, diagnose, refine, and improve the model to keep up with an ever changing world.

What I have wanted to know was how do we practically accomplish this? How do we get to the point where all software learns? Well, we need a reliable way of incorporating new learnings into our machine learning models, and we need to move beyond waterfall-esque model development to rapid quantitative improvements.

Capability vs. Ability

The Data Science Process

Functional diagram of a machine learning system. Model development is typically an offline process which results in a trained model or inference pipeline to be incorporated into a production analytics system. Over time, data from the production system (typically a data lake) is pulled into the model development process to improve an analytic’s quality and/or performance. (Image by author)

Software Development: Two Life Cycles Diverge in the Woods

1*s89fDWmtqAzhNX oF3YCBw
The Software Development Life Cycle (SDLC) is a useful construct to show the journey software must continually undergo. (Image by author)
  1. Version control — managing code in versions, tracking history, roll back to a previous version if needed
  2. CI/CD — automated testing on code changes, remove manual overhead
  3. Agile software development — short release cycles, incorporate feedback, emphasize team collaboration
  4. Continuous monitoring — ensure visibility into the performance and health of the application, alerts for undesired conditions
  5. Infrastructure as code — automate dependable deployments
Testing and monitoring required in traditional software systems (Image adapted from: The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction)
  1. We start with the code repository, adding unit tests to ensure that the addition of a new feature doesn’t break functionality.
  2. Code is then integration tested. That’s where different software pieces are tested together to ensure the full system functions as expected.
  3. The system is then deployed and monitored, collecting information on user experience, resource utilization, errors, alerts and notifications that inform future development.

The Two Loops

The Two Loops: A model of what machine learning software development encompasses. The Code Loop is crucial to develop the ML software for model stability and efficiency, while the Data Loop is essential to improving model quality and maintaining the model’s relevance. Creating ML models requires the Code Loop and Data Loop to interact at various stages, such as model training and monitoring. (Image by author)

Data: The new source code

  • Data is bigger in size and quantity
  • Data grows and changes frequently
  • Data has many types and forms (video, audio, text, tabular, etc.)
  • Data has privacy challenges, beyond managing secrets (GDPR, CCPA, HIPAA, etc.)
  • Data can exist in various stages (raw, processed, filtered, feature-form, model artifacts, metadata, etc.)

Data Bugs

Testing and monitoring required in machine learning systems (Image adapted from: The ML Test Score: A Rubric for ML Production Readiness and Technical Debt Reduction)

“Keeping the algorithm fixed and varying the training data was really, in my view, the most efficient way to build the system.” — Andrew Ng

MLOps: DevOps for Machine Learning

MLOps Principles and Key Considerations (updated/modified by author from Summary of MLOps Principles and Best Practices)

Practical MLOps: Pachyderm for data and code+data

1*Ugp7UStnQvt UuS3als1AQ
Pachyderm architectural and user overview (Image by Pachyderm)
  • In the case of how to handle changing data, Pachyderm treats data as git-like commits to versioned data repositories. These commits are immutable, meaning that data objects committed are not just metadata references to where the data is stored, but actual versioned objects/files so that nothing is ever lost. This means that if our training data is overwritten with a new version in a new commit, the old version is always recoverable, because it is saved in a previous commit.
  • In the case of types and formats, data is treated as opaque and general. Any type of file: audio, video, csv, etc. — anything can be placed into storage and read by pipelines, making it generalizable and uncompromising, instead of simply focused on one kind of data the way a database focuses only certain kinds of well structured data.
  • As for the size and quantity of data, Pachyderm uses a distributed object storage backend which can scale to essentially any size. In practice, it is based on object storage services, such as s3 or google cloud storage.
From blog post Pachyderm and the power of GitHub Actions: MLOps meets DevOps (Image by author)

There and Back Again