Stream processing will not be replacing batch processing anytime soon. You should definitely consider using batch processing in situations when large volumes of data need to be processed, the work involved in processing the data is repetitive and it's not imperative to get results in real-time.
When deciding whether to use batch or stream processing, it can be helpful to review the differences between the two approaches. (Also read: What is the difference between batch and stream processing?)
- Batch processing refers to the processing and analysis of large data sets at a scheduled time.
- Stream processing refers to the processing and analyzing of individual data items as they flow through a system.
With batch processing, users collect data over time and schedule it for processing when computing resources are available. This approach, which uses a scheduled "batch window" to process data, is useful for processing large amounts of data when latency is not an issue.
In contrast, stream processing processes data as soon as it's produced. This approach, which is often event-driven, is useful for processing data when latency is unacceptable. (Read: The Advantages of Real-Time Analytics for Business.)
It's important to note that neither batch nor stream processing is a “one-size-fits-all” answer for a project's data processing needs as they serve different functions. In fact, the same company will often use both batch and stream processing. A cloud service provider, for example, may use stream processing to collect user data but use batch processing to manage customer billing cycles. That's because both batch and stream processing have their own benefits and drawbacks.
Some of the benefits of batch processing to keep in mind:
- Batches can be scheduled to run on a regular basis and free you up to do other work.
- You can schedule batches during off-hours, which can be more cost-effective than processing large amounts of data during business hours.
- In terms of scope, batch processing allows queries over a majority, if not all, of the data in a data set. (Because of the real-time nature of stream processing, queries are processed on the most recent data record.)