Streaming vs. Batch: Why Micro-Batch Is Often the Right Answer

Somewhere along the way, "streaming" became synonymous with "always-on, real-time, millisecond latency." That framing does a lot of damage. It makes people think they have to choose: either run a consumer 24 hours a day, 7 days a week, paying for compute continuously, or stick with batch and accept hourly or daily freshness. That is a false choice, and it is costing people money.

The right frame is a spectrum. At one end: true event-driven streaming, processing each message within milliseconds of arrival. At the other: traditional batch, where you accumulate data for hours or days and process it in one go. In the middle: micro-batch, where you run small batch jobs frequently — every 5 minutes, every 15, every hour. Micro-batch is often the right answer, and it is underrated.

When True Streaming Is Actually Required

True always-on streaming is justified when the latency requirement is measured in seconds or less and the business consequence of delay is real. Fraud detection is the canonical example — you need to evaluate a transaction before it clears, not 15 minutes later. Real-time dashboards for operations centers are another case. Live inventory systems where a buyer and a seller can both see the same item simultaneously require it.

For most analytics use cases, that latency requirement does not exist. Your data science team does not need last-second data. Your business intelligence reports do not need sub-minute freshness. Your machine learning feature store update can wait five minutes. If the requirement is "fresher than daily," micro-batch usually gets you there without the operational cost of always-on infrastructure.

What Micro-Batch Actually Means

Micro-batch is a scheduled batch job that runs frequently. You set a job to fire every 15 minutes. It reads all messages from your raw zone that arrived since the last run, deserializes them, applies transformations, and writes the results. Then it stops. The compute resource — the cluster, the VM, the function — only exists while the job is running.

Apache Spark's original streaming model (DStreams) was explicitly micro-batch: it collected messages into small RDDs on a configurable interval and processed each batch as a mini-Spark job. You could set the batch interval to 30 seconds, 5 minutes, or an hour depending on your latency tolerance. The model was familiar to batch Spark users and the tradeoffs were explicit.

# Conceptual micro-batch pattern (pre-Structured Streaming)
# Run this script on a cron: */15 * * * * python process_batch.py

import time

BATCH_INTERVAL_SECONDS = 900  # 15 minutes

def get_unprocessed_raw_records(since_timestamp):
    # Read raw records landed since the last run
    return raw_store.query(f"arrived_at_ms > {since_timestamp}")

def process_batch(records):
    for record in records:
        parsed = deserialize(record['raw_payload'])
        structured_store.upsert(parsed)

last_run = checkpoint_store.get_last_run_timestamp()
records = get_unprocessed_raw_records(since=last_run)
process_batch(records)
checkpoint_store.update_last_run_timestamp(int(time.time() * 1000))

The Cost Math

Consider an always-on consumer cluster: 2 worker nodes running 24/7 to process a Kafka topic that generates 100,000 messages per hour. Your compute cost is 24 hours times whatever the hourly rate is — every day, whether the topic has traffic or not.

Now consider a micro-batch job: it runs for 3 minutes every 15 minutes. The compute exists for 3 out of every 15 minutes — 20% of the time. Your effective compute cost drops by 80%, and your data freshness is 15 minutes instead of sub-second. If 15-minute freshness meets your SLA, that 80% reduction is yours with no architectural sacrifice.

Cloud vendors love always-on consumers because they love consistent revenue. The business case for micro-batch is not complicated: pay for what you use, when you use it.

When Micro-Batch Breaks Down

Micro-batch fails when the batch cannot keep up with the arrival rate. If your topic receives 10 million messages per 15-minute window and your batch job takes 20 minutes to process them, you will fall behind. Each run starts more behind than the last. Eventually you are processing 30-minute-old data on a 15-minute schedule, which is a slow disaster.

The fix is not always to switch to true streaming. Often it is to optimize the batch job (better serialization, more parallelism, better merge logic) or to extend the batch window (run every hour instead of every 15 minutes, giving the job more headroom). True streaming solves the latency problem; it does not solve the throughput problem.

Figure out your actual latency requirement before you commit to an architecture. Then find the simplest implementation that meets it. More often than not, micro-batch is that implementation. I am here to help if you want to think through the math for your specific case.

Read more