JDBC in Databricks: Pulling Data From SQL Server Without Leaving Your Lake

Your source data is still in SQL Server. That's fine — it doesn't have to move all at once. Databricks can read from SQL Server over JDBC, which means you can start building Spark pipelines against live data before you've migrated a single table.

Here's what the basic connection looks like:

jdbc_url = "jdbc:sqlserver://myserver.database.windows.net:1433;databaseName=OperationsDB"

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.CustomerOrders")
  .option("user", "svc_databricks")
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .load())

df.show(5)

Don't hardcode credentials — that's what secret scopes are for. I'll cover those separately, but the short version: use dbutils.secrets.get() and never put a password directly in a notebook.

The Problem With Naive JDBC Reads

The above works. For a 10,000-row reference table, it's fine. For a 50-million-row fact table, it will run one JDBC connection, read all 50 million rows over a single thread, and take a very long time while the rest of your cluster sits idle.

Spark can parallelize JDBC reads, but it needs help knowing how to split the work:

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.CustomerOrders")
  .option("user", "svc_databricks")
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .option("partitionColumn", "order_id")
  .option("lowerBound", "1")
  .option("upperBound", "50000000")
  .option("numPartitions", "20")
  .load())

Spark divides the range between lowerBound and upperBound into numPartitions equal chunks and issues 20 parallel SQL queries — one per partition, each as its own JDBC connection.

Three gotchas that will bite you:

  • Bounds are hints, not constraints. If upperBound is lower than your actual max, Spark truncates — you'll miss rows. Set it above the actual max or pull it dynamically first.
  • Don't over-partition. Each partition is a separate SQL Server connection. 20 parallel connections is fine; 200 will saturate your connection pool.
  • Skewed IDs kill the benefit. If 80% of rows have IDs under 100,000, splitting 1–50,000,000 into 20 equal slices gives you one very busy partition and 19 nearly empty ones. Use a column with even distribution.

Pushing Predicates Down to SQL Server

When you filter a DataFrame that came from JDBC, Spark pushes simple predicates down to the SQL Server query — so you're not reading 50 million rows into Spark and filtering there.

-- Spark generates: SELECT * FROM dbo.CustomerOrders WHERE order_date >= '2019-01-01'
recent_orders = df.filter("order_date >= '2019-01-01'")

Check the EXPLAIN output to verify the filter is inside the FileScan jdbc operator, not above it. If it's above the scan, you're reading everything and filtering in Spark — which defeats the purpose.

For complex queries that Spark can't push down, use the query option with the full SQL string. Note that you can't combine query with partitionColumn — if the query result is large, wrap it as a subquery in a dbtable option instead.

Writing Back to SQL Server

(result_df.write
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.ProcessedOrders")
  .option("user", "svc_databricks")
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .mode("overwrite")
  .save())

One thing to know: each partition write is its own transaction against SQL Server. If you have 20 partitions and partition 15 fails mid-write in overwrite mode, you end up with a partially-written table and no automatic rollback. Write to a staging table first, validate, then swap or merge into the final destination.

JDBC isn't the long-term answer — you'll want critical datasets in Delta Lake eventually. But it's a solid bridge while you're migrating, and understanding its performance characteristics up front saves you from discovering them at 2am. As always, I'm here to help.

Read more