JDBC Partitioning: Reading Large SQL Server Tables Into Spark Without Running Out of Options

Reading a SQL Server table into Spark via JDBC is straightforward until the table is large. At that point, the naive approach — one JDBC connection, all rows, read sequentially — becomes impractically slow. The fix is parallel JDBC reads using a partition column, but the standard partition column approach has a gotcha that shows up as soon as your data doesn't have a clean integer ID.

The Basic Parallel Read (and Why It Breaks)

The standard Spark JDBC partitioning divides a numeric range into equal slices:

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.OrderEvents")
  .option("partitionColumn", "event_id")
  .option("lowerBound", "1")
  .option("upperBound", "10000000")
  .option("numPartitions", "20")
  .load())

This issues 20 queries like: WHERE event_id BETWEEN 1 AND 500000, WHERE event_id BETWEEN 500001 AND 1000000, etc. Each query runs as a separate JDBC connection in parallel.

The problem: event IDs are rarely uniformly distributed. If you have auto-incrementing IDs over a few years of activity, the recent ranges (high IDs) have far more data than the early ranges (low IDs). You end up with 19 nearly-empty partitions and 1 enormous one — which gives you one executor doing 95% of the work while 19 sit idle.

The Fix: Date-Based Partitioning

For tables with a timestamp column, date-based partitioning distributes work by time window, which is usually more even than numeric ID ranges:

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.OrderEvents")
  .option("partitionColumn", "event_date")
  .option("lowerBound", "2019-01-01")
  .option("upperBound", "2020-01-31")
  .option("numPartitions", "25")  # roughly one partition per 2 weeks
  .load())

Each partition handles a roughly equal time window. If you have seasonal patterns (more events in Q4), the workload is still uneven — but much less so than ID-based splits on an auto-increment column.

The Advanced Fix: Custom Predicates

For cases where you need full control over the split, you can specify explicit predicate strings instead of using the built-in partition logic:

predicates = [
    "event_date BETWEEN '2019-01-01' AND '2019-03-31'",
    "event_date BETWEEN '2019-04-01' AND '2019-06-30'",
    "event_date BETWEEN '2019-07-01' AND '2019-09-30'",
    "event_date BETWEEN '2019-10-01' AND '2019-12-31'",
    "event_date BETWEEN '2020-01-01' AND '2020-01-31'",
]

df = (spark.read
  .format("jdbc")
  .option("url", jdbc_url)
  .option("dbtable", "dbo.OrderEvents")
  .option("user", svc_user)
  .option("password", dbutils.secrets.get("myproject", "sql-password"))
  .load(predicates=predicates))

Each predicate becomes one partition. If you know your data is skewed toward Q4, you can split Q4 into more predicates than Q1–Q3. Manual, but fully precise.

The fetchSize Setting

Regardless of partitioning strategy, always set fetchsize:

.option("fetchsize", "10000")

Default is 10 rows per network round trip. For a 5-million-row partition, that's 500,000 round trips. Set it to 10,000 and you're down to 500. This alone can cut JDBC extract time by an order of magnitude for wide tables. As always, I'm here to help.

Read more