Pachyderm has been acquired by Hewlett Packard Enterprise - Learn more

Batch vs Streaming Data for Machine Learning Pipelines

In the early days of machine learning, there was a distinct line between batch processing and streaming processing. Generally, pipelines and architecture were built for one and could not interact with the other. With modern platforms and programming approaches, it’s becoming more and more difficult to maintain that distinction between the two.

Why? New approaches to data, data architecture, and the introduction of micro-batching and microservices. Knowing when your use case calls for batch or stream processing helps you plan projects, tooling, and storage, which is why it’s essential to understand the difference between the two.

This article will discuss the differences between batch and streaming data, data batch processing vs. data stream processing, and why you should include each in your processes. So, where is the line between batch and streaming, and how can you leverage this information when building your own ML pipelines?

Let’s dive into…

Data Batch Processing: What it Is & When to Use It

Batch processing is an asynchronous process: data accumulates in a storage repository until a certain condition is met, and that data is processed through a pipeline and delivered to its endpoint. This type of data typically does not arrive in real time, and it also does not need to be processed in real-time. Instead, pipelines process accumulated data in a batch.

A newer approach to batch processing takes advantage of the recent microservices trend. Batch processing jobs are scheduled closer together in micro-batches, running every few minutes or seconds. This delivers more data results at a higher frequency than users could previously expect access to, which makes it especially popular in customer-facing applications.

Batch systems run at set times; typically every hour, half hour, 10 minutes… The list goes on. If you experience a delay in running a query in an app or a dashboard, it is typically because a batch process is being run for your result.

So, what does this look like in action?

Picture a fraud detection model – in digital commerce, new threats can spread fast, so your detection systems need to stay up-to-date.

To stay ahead of the bad guys and protect your customers, you have a machine learning fraud detection system that checks recent transactions for suspicious activity every 30 minutes during business hours, and the results of those flags are saved in your structured data store.

Your model is updated every night when most of the day’s business is done: A batch of the last 24h of transactions and known fraudulent activity data is fed through the model for retraining.

With this method, you’ll have upgraded fraud detection every morning, and re-start the cycle of collecting flagged and known fraudulent transactions which retrains the detection model on a daily basis.

Streaming Data Processing: What it Is & When to Use It

If your data is processed as a constant flow as it is ingested into your data pipelines, it’s known as streaming data. In general, streaming data is highly perishable – This is also referred to as “always on” or real-time processing.

Streaming data is usually moment-to-moment insight like financial market conditions: a firehose of data where a small slice will be highly relevant to your needs, and the rest is extraneous and doesn’t need to be saved. This data is usually delivered to a pipeline with a tool like Apache Kafka or RabbitMQ, which do not store the data – they just direct some of the flow your way.

Dive Deeper: Watch this webinar to see how GTRI uses Pachyderm pipelines in real-time EDGE computing scenarios.

When building a streaming data pipeline, your processing is highly selective: only certain data is taken from the stream and processed by your model to serve predictions, reporting, or other outputs. The ephemeral nature of this data underlines the importance of your ML stack, especially when things start to go wrong.

Because this data’s value is so dependent on how quickly it can be transformed and processed by your model, service outages can have a major impact on the teams relying on it. Without comprehensive lineage of your processed data and model transformations, troubleshooting data and model issues can extend outages and cost organizations time and money.

Using Both Data Processing Methods in Your ML Pipelines

With the lines between batch and streaming data blurring thanks to micro-batching and microservices, there are a variety of effective approaches to achieving practical MLOps success. For example, you may process streaming data in production while building and updating your model as a batch process in near real time with micro-batch, high-frequency batch processing.

Instead of waiting for your batch systems to run every week or once a night, micro-batches can provide near real-time delivery experiences by processing in short one to five minute batches.

Start Strengthening Your Machine Learning Processes with Improved Data Processing

The concept of machine learning operations is built upon the principle of continuous integration and delivery from DevOps. For machine learning in production, this means focusing on reproducibility as a core measure of quality and stability, regardless of how complex your models become over time.

As your models become more complex, you will need to increase the frequency of your model training and will need better model and data lineage that can trace back to errors quickly. That’s where data version control plays a part.

Our team provides a version control layer within your data storage solutions, so you can have a full picture of how your data was gathered, accessed, processed, and more. This goes beyond data storage and logs, which means you can work within your code in new, easier ways.

If you’re ready to learn more about data pipelines and how Pachyderm can help within the confines of your business, reach out to our team to book a demo and learn all about our solutions today.