Databricks Connect: Testing Spark Code Without Spinning Up a Cluster Session

Shannon Lowder

24 May 2019 — 2 min read

Spark jobs have a startup cost. The cluster needs to provision, the JVMs need to start, the session needs to initialize. For a two-hour batch job, that's noise. For a 45-second unit test, that's the entire runtime budget.

Databricks Connect lets you run Spark code from your local machine against a remote Databricks cluster. Write code in your IDE, debug it locally with your usual tools, and execute it against the actual cluster without going through the notebook interface. For SQL Server DBAs building data pipelines, this changes the development loop significantly.

Setting Up Databricks Connect

# Install the library (version must match your cluster's Databricks Runtime version)
pip install databricks-connect==6.4.*

# Configure with your workspace details
databricks-connect configure

The configure step prompts for:

Databricks host (e.g., https://adb-1234567890.12.azuredatabricks.net)
Personal access token
Cluster ID to connect to
Org ID (Azure only)
Port (default 15001)

After configuration, any Python script that imports PySpark will automatically route Spark operations to your Databricks cluster:

from pyspark.sql import SparkSession
from pyspark.sql.functions import col, sum as spark_sum

spark = SparkSession.builder.getOrCreate()

# This runs on your Databricks cluster, not locally
df = spark.read.format("delta").load("/mnt/myproject/silver/orders")
result = df.filter(col("region") == "West").groupBy("order_date").agg(spark_sum("total_amount").alias("daily_revenue"))
result.show()

What Runs Where

Databricks Connect executes Spark operations on the cluster and brings results back to your local machine. Your local Python process handles the orchestration logic. Large DataFrame operations happen on the cluster. .toPandas() and .collect() bring data back locally.

This means you can:

Set breakpoints in your IDE and step through pipeline logic
Run unit tests against real cluster behavior without notebooks
Use local git workflows and standard Python tooling
Test against real data in your development environment before submitting as a job

The Version Pinning Requirement

Databricks Connect version must exactly match your cluster's Databricks Runtime (DBR) version. If your cluster runs DBR 6.4, you need databricks-connect==6.4.*. This is more strict than most Python library versioning. If you work across multiple workspaces with different DBR versions, use a separate virtual environment per workspace.

# Create a venv for each DBR version you work with
python -m venv dbr64_env
source dbr64_env/bin/activate
pip install databricks-connect==6.4.*

Limitations to Know About

A few things that don't work through Databricks Connect:

dbutils methods (file system operations, secrets, widgets) — those require a running notebook context
Spark Structured Streaming — not supported in Databricks Connect currently
MLflow autologging — works, but only if the cluster has it configured

For pipeline code that doesn't need notebook-specific features, Databricks Connect gives you a proper engineering workflow: write locally, test against real infrastructure, commit when it works. That loop is much tighter than edit-upload-notebook-run-check-output. As always, I'm here to help.

Databricks Connect: Testing Spark Code Without Spinning Up a Cluster Session

Shannon Lowder

Setting Up Databricks Connect

What Runs Where

The Version Pinning Requirement

Limitations to Know About

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving