Databricks Connect: Testing Spark Code Without Spinning Up a Cluster Session
Spark jobs have a startup cost. The cluster needs to provision, the JVMs need to start, the session needs to initialize. For a two-hour batch job, that's noise. For a 45-second unit test, that's the entire runtime budget.
Databricks Connect lets you run Spark code from your local machine against a remote Databricks cluster. Write code in your IDE, debug it locally with your usual tools, and execute it against the actual cluster without going through the notebook interface. For SQL Server DBAs building data pipelines, this changes the development loop significantly.
Setting Up Databricks Connect
# Install the library (version must match your cluster's Databricks Runtime version)
pip install databricks-connect==6.4.*
# Configure with your workspace details
databricks-connect configureThe configure step prompts for:
- Databricks host (e.g.,
https://adb-1234567890.12.azuredatabricks.net) - Personal access token
- Cluster ID to connect to
- Org ID (Azure only)
- Port (default 15001)
After configuration, any Python script that imports PySpark will automatically route Spark operations to your Databricks cluster:
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum
spark = SparkSession.builder.getOrCreate()
# This runs on your Databricks cluster, not locally
df = spark.read.format("delta").load("/mnt/myproject/silver/orders")
result = df.filter(col("region") == "West").groupBy("order_date").agg(spark_sum("total_amount").alias("daily_revenue"))
result.show()What Runs Where
Databricks Connect executes Spark operations on the cluster and brings results back to your local machine. Your local Python process handles the orchestration logic. Large DataFrame operations happen on the cluster. .toPandas() and .collect() bring data back locally.
This means you can:
- Set breakpoints in your IDE and step through pipeline logic
- Run unit tests against real cluster behavior without notebooks
- Use local git workflows and standard Python tooling
- Test against real data in your development environment before submitting as a job
The Version Pinning Requirement
Databricks Connect version must exactly match your cluster's Databricks Runtime (DBR) version. If your cluster runs DBR 6.4, you need databricks-connect==6.4.*. This is more strict than most Python library versioning. If you work across multiple workspaces with different DBR versions, use a separate virtual environment per workspace.
# Create a venv for each DBR version you work with
python -m venv dbr64_env
source dbr64_env/bin/activate
pip install databricks-connect==6.4.*Limitations to Know About
A few things that don't work through Databricks Connect:
dbutilsmethods (file system operations, secrets, widgets) — those require a running notebook context- Spark Structured Streaming — not supported in Databricks Connect currently
- MLflow autologging — works, but only if the cluster has it configured
For pipeline code that doesn't need notebook-specific features, Databricks Connect gives you a proper engineering workflow: write locally, test against real infrastructure, commit when it works. That loop is much tighter than edit-upload-notebook-run-check-output. As always, I'm here to help.