JDBC in Databricks: Pulling Data From SQL Server Without Leaving Your Lake
Your source data is still in SQL Server. That's fine — it doesn't have to move all at once. Databricks can read from SQL Server over JDBC, which means you can start building Spark pipelines against live data before you've migrated a single table.
Here's what the basic connection looks like:
jdbc_url = "jdbc:sqlserver://myserver.database.windows.net:1433;databaseName=OperationsDB"
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "dbo.CustomerOrders")
.option("user", "svc_databricks")
.option("password", dbutils.secrets.get("myproject", "sql-password"))
.load())
df.show(5)Don't hardcode credentials — that's what secret scopes are for. I'll cover those separately, but the short version: use dbutils.secrets.get() and never put a password directly in a notebook.
The Problem With Naive JDBC Reads
The above works. For a 10,000-row reference table, it's fine. For a 50-million-row fact table, it will run one JDBC connection, read all 50 million rows over a single thread, and take a very long time while the rest of your cluster sits idle.
Spark can parallelize JDBC reads, but it needs help knowing how to split the work:
df = (spark.read
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "dbo.CustomerOrders")
.option("user", "svc_databricks")
.option("password", dbutils.secrets.get("myproject", "sql-password"))
.option("partitionColumn", "order_id")
.option("lowerBound", "1")
.option("upperBound", "50000000")
.option("numPartitions", "20")
.load())Spark divides the range between lowerBound and upperBound into numPartitions equal chunks and issues 20 parallel SQL queries — one per partition, each as its own JDBC connection.
Three gotchas that will bite you:
- Bounds are hints, not constraints. If
upperBoundis lower than your actual max, Spark truncates — you'll miss rows. Set it above the actual max or pull it dynamically first. - Don't over-partition. Each partition is a separate SQL Server connection. 20 parallel connections is fine; 200 will saturate your connection pool.
- Skewed IDs kill the benefit. If 80% of rows have IDs under 100,000, splitting 1–50,000,000 into 20 equal slices gives you one very busy partition and 19 nearly empty ones. Use a column with even distribution.
Pushing Predicates Down to SQL Server
When you filter a DataFrame that came from JDBC, Spark pushes simple predicates down to the SQL Server query — so you're not reading 50 million rows into Spark and filtering there.
-- Spark generates: SELECT * FROM dbo.CustomerOrders WHERE order_date >= '2019-01-01'
recent_orders = df.filter("order_date >= '2019-01-01'")Check the EXPLAIN output to verify the filter is inside the FileScan jdbc operator, not above it. If it's above the scan, you're reading everything and filtering in Spark — which defeats the purpose.
For complex queries that Spark can't push down, use the query option with the full SQL string. Note that you can't combine query with partitionColumn — if the query result is large, wrap it as a subquery in a dbtable option instead.
Writing Back to SQL Server
(result_df.write
.format("jdbc")
.option("url", jdbc_url)
.option("dbtable", "dbo.ProcessedOrders")
.option("user", "svc_databricks")
.option("password", dbutils.secrets.get("myproject", "sql-password"))
.mode("overwrite")
.save())One thing to know: each partition write is its own transaction against SQL Server. If you have 20 partitions and partition 15 fails mid-write in overwrite mode, you end up with a partially-written table and no automatic rollback. Write to a staging table first, validate, then swap or merge into the final destination.
JDBC isn't the long-term answer — you'll want critical datasets in Delta Lake eventually. But it's a solid bridge while you're migrating, and understanding its performance characteristics up front saves you from discovering them at 2am. As always, I'm here to help.