Databricks Clusters: Interactive, Job, and How to Not Run Up a Cloud Bill

The first time I got a Databricks invoice that was significantly higher than expected, I went looking for where the compute was going. The answer: an interactive cluster I'd left running overnight while prototyping. Nobody was using it. It was just there, burning DBUs. Here's how to not repeat that mistake.

Two Types of Clusters

All-Purpose (Interactive) Clusters are persistent, multi-user clusters intended for development, exploration, and collaborative notebook work. They stay running until you manually terminate them or until the auto-termination timeout fires. They support attaching multiple notebooks simultaneously. DBUs accrue the entire time the cluster is running, whether or not any code is executing.

Job Clusters are created automatically at job start time and terminated automatically when the job finishes. They support exactly one job run. Cost is limited to the job execution time. For production pipelines, these are what you want.

# In Databricks Jobs UI or Jobs API:
# Job cluster configuration looks like:
{
    "new_cluster": {
        "spark_version": "7.3.x-scala2.12",
        "node_type_id": "Standard_DS3_v2",
        "num_workers": 4,
        "autoscale": {
            "min_workers": 2,
            "max_workers": 8
        }
    }
}

Cluster Sizing

Every cluster has a driver node and zero or more worker nodes. Driver and workers can be different instance types — in practice, you want the driver to be modestly sized (it's coordinating work, not doing it) and the workers sized for your actual computation load.

For Databricks on Azure, the Standard_DS3_v2 (4 vCPUs, 14GB RAM) is a solid worker default for analytical workloads. Each worker contributes its cores to the executor pool. A 4-worker cluster of DS3_v2 gives you 16 executor cores and about 56GB of total executor memory.

For SQL Server DBAs thinking in terms of MAXDOP: your effective parallelism is bounded by the number of partitions in your DataFrame and the number of executor cores. A 16-core cluster running a job with 200 shuffle partitions can process 16 partitions simultaneously — tune both together.

Autoscaling

Databricks autoscaling adds and removes worker nodes based on workload. During large shuffles, it adds workers. Between jobs, it scales down. For interactive clusters with variable workloads (a team running different notebooks throughout the day), autoscaling is usually worth it.

For job clusters running consistent batch workloads, a fixed worker count is often better: you don't want scale-up latency in the middle of a production pipeline, and predictable sizing makes cost forecasting easier.

Auto-Termination

Every interactive cluster should have auto-termination configured. The default is 120 minutes of inactivity. I set mine to 60. If you're done working for the day and forget to terminate the cluster, auto-termination saves you from an overnight bill.

# Via Databricks CLI
databricks clusters edit --json '{
    "cluster_id": "your-cluster-id",
    "autotermination_minutes": 60
}'

Spot Instances

Azure Spot VMs (and AWS Spot instances) are surplus capacity sold at up to 90% discount, with the caveat that the provider can reclaim them with short notice. For worker nodes in batch pipelines that can retry failed tasks, spot workers can cut compute cost dramatically. For driver nodes, use on-demand — losing the driver mid-job terminates the entire application.

Databricks handles spot interruption gracefully for workers: if a worker is reclaimed, any incomplete tasks are rescheduled on remaining workers. The job slows down temporarily but doesn't fail (assuming shuffle data on disk can be re-read or tasks re-executed).

DBU Pricing: What You're Actually Paying For

Databricks charges in Databricks Units (DBUs) per instance-hour. DBU rates vary by cluster type:

  • All-Purpose Compute (interactive): higher DBU rate, accrues while cluster is running
  • Jobs Compute: lower DBU rate, accrues only during job execution
  • SQL Compute (SQL warehouses): different rate again, idle cost when warehouse is on but not executing

The actual cost is DBU rate × instance cost × hours. Both DBU charges (to Databricks) and VM charges (to the cloud provider) apply. This is why a cluster left idle overnight costs real money even if it's not doing anything — the VMs are running and Databricks is counting DBUs.

The operational rule: dev and exploration on interactive clusters with auto-termination. Production pipelines on job clusters, sized and measured carefully. Right-size the cluster for each pipeline — a light transformation job doesn't need the same cluster as a heavy ML training job.

Read more