Streaming vs Batch in Databricks: Making the Decision You Will Have to Live With

The question comes up on every Databricks project: should this pipeline run as a stream or as a batch? Most of the content on this topic presents it as a streaming-vs-batch debate with streaming as the obviously modern choice. The reality is more useful — batch is the right answer more often than you'd think, and choosing wrong in either direction has real costs.

What the Decision Actually Comes Down To

The decision isn't "which is better" — it's "what does the use case actually require?"

Latency requirement — how fresh does the data need to be? Sub-minute freshness genuinely requires streaming. Hour-old data can be served by a batch job that runs hourly. Daily-refresh reporting needs batch. Most business analytics use cases are fine with hourly or daily freshness, and streaming adds complexity without adding value for them.

State management — streaming pipelines with aggregations require checkpointing, watermarking, and careful handling of late data. Batch pipelines read a window of data, compute a result, write it, and exit cleanly. If your pipeline logic is complex, batch is significantly easier to reason about and debug.

Cost model — a streaming cluster runs 24/7 whether your topic is active or not. A batch job runs for 20 minutes every hour. For low-volume streams, batch is substantially cheaper. For high-volume streams where continuous processing is necessary to keep up, streaming is required and the cost is justified.

When Streaming Is the Right Answer

  • Real-time fraud detection where a 10-minute delay means an uncaught transaction
  • Customer-facing dashboards that need to reflect activity in the last few minutes
  • Kafka consumer pipelines where the message rate is high enough that a batch job would take too long to catch up if it fell behind
  • Continuous aggregations (rolling 5-minute counts, real-time session tracking)

When Batch Is the Right Answer

  • Daily, weekly, or monthly reporting where freshness is measured in hours or days
  • Complex multi-stage transformations with joins and aggregations that are hard to express as a stateful stream
  • Incremental loads from SQL Server or other transactional systems (JDBC-based ingestion is almost always batch)
  • Machine learning feature computation that runs before a training job

The Middle Ground: Micro-Batch With a Trigger

# Runs the stream once and exits — the "trigger once" pattern
# Gets you sub-hourly freshness with batch-like simplicity
(orders_stream.writeStream
  .format("delta")
  .outputMode("append")
  .option("checkpointLocation", "/mnt/checkpoints/orders")
  .trigger(once=True)
  .start()
  .awaitTermination())

Trigger-once streaming is a useful middle ground: you get the streaming API (Kafka source, incremental processing, checkpoint-based exactly-once delivery) without the continuous compute cost. Run it every 15 minutes via a Databricks job. The cluster starts, processes all accumulated messages, and terminates. Freshness: 15 minutes. Cost: similar to batch.

If you're starting a new pipeline and aren't sure which to use: start with batch, measure whether the latency is acceptable, and switch to streaming only if the business case actually requires sub-hourly freshness. The reverse migration (stream-to-batch) is almost never necessary. As always, I'm here to help.

Read more