JDBC in Databricks: Pulling Data From SQL Server Without Leaving Your Lake

Shannon Lowder

05 Jan 2019 — 2 min read

Your source data is still in SQL Server. That's fine — it doesn't have to move all at once. Databricks can read from SQL Server over JDBC, which means you can start building Spark pipelines against live data before you've migrated a single table.

Here's what the basic connection looks like:

jdbc_url = "jdbc:sqlserver://myserver.database.windows.net:1433;databaseName=OperationsDB"

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.CustomerOrders")
  .option("user", "svc_databricks")
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .load())

df.show(5)

Don't hardcode credentials — that's what secret scopes are for. I'll cover those separately, but the short version: use dbutils.secrets.get() and never put a password directly in a notebook.

The Problem With Naive JDBC Reads

The above works. For a 10,000-row reference table, it's fine. For a 50-million-row fact table, it will run one JDBC connection, read all 50 million rows over a single thread, and take a very long time while the rest of your cluster sits idle.

Spark can parallelize JDBC reads, but it needs help knowing how to split the work:

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.CustomerOrders")
  .option("user", "svc_databricks")
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .option("partitionColumn", "order_id")
  .option("lowerBound", "1")
  .option("upperBound", "50000000")
  .option("numPartitions", "20")
  .load())

Spark divides the range between lowerBound and upperBound into numPartitions equal chunks and issues 20 parallel SQL queries — one per partition, each as its own JDBC connection.

Three gotchas that will bite you:

Bounds are hints, not constraints. If upperBound is lower than your actual max, Spark truncates — you'll miss rows. Set it above the actual max or pull it dynamically first.
Don't over-partition. Each partition is a separate SQL Server connection. 20 parallel connections is fine; 200 will saturate your connection pool.
Skewed IDs kill the benefit. If 80% of rows have IDs under 100,000, splitting 1–50,000,000 into 20 equal slices gives you one very busy partition and 19 nearly empty ones. Use a column with even distribution.

Pushing Predicates Down to SQL Server

When you filter a DataFrame that came from JDBC, Spark pushes simple predicates down to the SQL Server query — so you're not reading 50 million rows into Spark and filtering there.

-- Spark generates: SELECT * FROM dbo.CustomerOrders WHERE order_date >= '2019-01-01'
recent_orders = df.filter("order_date >= '2019-01-01'")

Check the EXPLAIN output to verify the filter is inside the FileScan jdbc operator, not above it. If it's above the scan, you're reading everything and filtering in Spark — which defeats the purpose.

For complex queries that Spark can't push down, use the query option with the full SQL string. Note that you can't combine query with partitionColumn — if the query result is large, wrap it as a subquery in a dbtable option instead.

Writing Back to SQL Server

(result_df.write
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.ProcessedOrders")
  .option("user", "svc_databricks")
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .mode("overwrite")
  .save())

One thing to know: each partition write is its own transaction against SQL Server. If you have 20 partitions and partition 15 fails mid-write in overwrite mode, you end up with a partially-written table and no automatic rollback. Write to a staging table first, validate, then swap or merge into the final destination.

JDBC isn't the long-term answer — you'll want critical datasets in Delta Lake eventually. But it's a solid bridge while you're migrating, and understanding its performance characteristics up front saves you from discovering them at 2am. As always, I'm here to help.

JDBC in Databricks: Pulling Data From SQL Server Without Leaving Your Lake

Shannon Lowder

The Problem With Naive JDBC Reads

Pushing Predicates Down to SQL Server

Writing Back to SQL Server

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving