Cloud cost discipline matters more now than it did two years ago. The pandemic has pushed a lot of workloads into the cloud on short timelines, and a lot of those workloads were not designed with cost in mind. Streaming pipelines are a common area of waste — they are easy to stand up on a generous cluster, hard to right-size, and nobody questions the bill until it is significant.
Let me put specific numbers on the cost of always-on streaming consumers in Databricks, and then show you the math on the alternative.
The Always-On Cluster Cost
A minimal Databricks streaming cluster for a production Kafka consumer in 2020 needs at minimum a driver node and one or two worker nodes. Let us use a common configuration: Standard_DS3_v2 on Azure (4 vCores, 14 GB RAM), with spot/preemptible workers where possible.
- Driver: Standard_DS3_v2 on-demand: ~$0.22/hour
- 2 workers: Standard_DS3_v2 spot: ~$0.07/hour each
- Databricks DBUs: Standard tier ~$0.20/DBU/hour, 2 DBUs for DS3_v2
- Total per hour (rough): ~$0.22 + $0.14 + $0.40 = ~$0.76/hour
- Per month (720 hours): ~$547/month per streaming job
For three streaming jobs in a monolithic pipeline (or even one job doing all three stages): $547-$1,641/month. Before storage, networking, or the adjacent services.
The Micro-Batch Alternative
With trigger(availableNow=True) in Structured Streaming, the job runs, processes all available data, and stops. If each run takes 4 minutes and you run every 15 minutes, the cluster exists for 4/15 = 27% of the time.
- Same cluster cost: ~$0.76/hour
- Effective utilization: 27%
- Effective hourly cost: ~$0.76 * 0.27 = ~$0.21/hour
- Per month: ~$151/month per job
For three jobs (raw ingest, deserialize, merge) running on separate schedules:
- Raw ingest (runs every 5 min, 2 min each): 40% utilization → ~$219/month
- Deserialize (runs every 15 min, 3 min each): 20% utilization → ~$110/month
- Merge (runs every 30 min, 5 min each): 17% utilization → ~$93/month
- Total: ~$422/month
Three separate jobs, three separate clusters running only when needed: $422/month. One always-on monolithic job: $547/month. And the three-job architecture gives you 15-minute data freshness on the gold layer — which for most analytics workloads is indistinguishable from real-time.
# Databricks job definition with auto-terminating cluster
{
"name": "sensor-raw-ingest",
"tasks": [{
"task_key": "raw_ingest",
"new_cluster": {
"spark_version": "7.3.x-scala2.12",
"node_type_id": "Standard_DS3_v2",
"num_workers": 2,
"azure_attributes": {
"availability": "SPOT_WITH_FALLBACK_AZURE",
"spot_bid_max_price": -1
}
},
"notebook_task": {
"notebook_path": "/Pipelines/sensor-raw-ingest"
}
}],
"schedule": {
"quartz_cron_expression": "0 */5 * * * ?", // every 5 minutes
"timezone_id": "UTC"
}
}
The Spot Instance Caveat
Spot instances can be preempted. For an always-on consumer, a spot preemption means your pipeline stops until the instance is replaced (usually within minutes, but not guaranteed). For a micro-batch job that runs for 4 minutes and stops, a spot preemption during the 4-minute window means that run fails and the next scheduled run picks up where the checkpoint left off. The impact of a single preemption is bounded.
This is another reason the micro-batch model is more resilient than always-on at equivalent cost: the blast radius of a spot preemption is one job run, not an ongoing pipeline outage.
Run the cost math for your specific instance types, DBU tier, and schedule before presenting it to stakeholders. The numbers vary, but the direction is consistent: scheduled micro-batch is cheaper than always-on for any workload with a latency tolerance measured in minutes. I am here to help if you want to work through the calculation for your setup.