Spark Partitions: The Unit of Parallelism That Controls How Fast Your Jobs Actually Run

When a Spark job runs slower than expected, the first thing I check is partition count. Not cluster size. Not memory settings. Partitions. Because if your partition count is wrong, adding more machines doesn't help — you're just adding more workers standing around with nothing to do.

What a Partition Is

A Spark DataFrame is divided into chunks called partitions. Each partition is a slice of the data that lives on one executor node during processing. Each task processes exactly one partition. The maximum number of tasks running simultaneously is limited by your total available executor cores.

The math: if you have 5 executors with 4 cores each, you have 20 executor cores. If your DataFrame has 20 partitions, all 20 tasks run in parallel — you're using the full cluster. If you have 4 partitions, only 4 tasks run and 16 cores sit idle. If you have 200 partitions, Spark processes them in 10 waves (20 at a time), which adds coordination overhead.

This is the same concept as SQL Server's MAXDOP and degree of parallelism, but at a much more granular level. In SQL Server, you control how many threads one query uses. In Spark, partitions control how much actual parallelism your data processing achieves across an entire cluster.

Where Partitions Come From

When reading files: Spark splits input files at block boundaries. The default target partition size is 128MB. A 1.28GB Parquet file becomes roughly 10 partitions on read.

df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/sales/")
print(f"Input partitions: {df.rdd.getNumPartitions()}")

After a shuffle (groupBy, join, repartition): Spark creates new partitions, defaulting to the value of spark.sql.shuffle.partitions. The default is 200. This is almost always wrong for anything but medium-large clusters.

# After a groupBy, how many partitions?
result = df.groupBy("region").agg(F.sum("revenue"))
print(f"After groupBy: {result.rdd.getNumPartitions()}")  # probably 200

200 shuffle partitions on a 4-node cluster with 16 total cores means 200 tasks processed 13 waves at a time — 12 waves of 16 plus a final wave of 8. The last wave wastes 8 cores. More importantly, if your data is 5GB and gets divided into 200 partitions, each partition is 25MB — tiny. The task scheduling overhead per partition can exceed the actual computation time.

Tuning Partition Count

The general heuristic: aim for partition sizes of 100–200MB for large datasets, and match your partition count to 2–4x your total executor cores so tasks queue efficiently without excessive overhead.

# Fix shuffle partition count at the session level
spark.conf.set("spark.sql.shuffle.partitions", "32")

# Or tune per-operation with repartition
df_repart = df.repartition(32)

# Hash partition on a key column for join/groupBy efficiency
df_repart = df.repartition(32, F.col("customer_id"))

# coalesce() reduces partitions without a full shuffle (more efficient for writes)
df_small = df.coalesce(4)

The difference between repartition() and coalesce():

  • repartition(n): full shuffle, redistributes data evenly across n partitions. Use when you need more partitions or when you want balanced partition sizes.
  • coalesce(n): merges adjacent partitions without a full shuffle. Only reduces partition count. More efficient for writing output — common pattern before writing to avoid many small output files.

The Small-File Problem

When you write a Spark DataFrame to storage, each partition becomes a file. 200 partitions = 200 files. For a 5GB result set, that's 200 files of 25MB each. Downstream readers (Databricks, Azure Synapse, Spark itself) have to open and read 200 file handles instead of 10 or 20. This overhead adds up.

# Before writing, reduce to a reasonable file count
output_df = result.coalesce(10)
output_df.write.format("delta").mode("overwrite").save("path/to/output/")

Delta Lake's OPTIMIZE command handles this after the fact — it compacts many small files into larger ones. But getting the partition count right before writing saves you a compaction step.

Reading the Partition Situation in Spark UI

In Databricks, every job has a Spark UI available via the cluster's Jobs tab. The Stages view shows partition count, task count, and shuffle read/write sizes. If you see a stage with 200 tasks where each task processes 2MB of data, your shuffle partitions are too high. If you see 4 tasks on a 20-core cluster, you have too few partitions.

The right number isn't a fixed formula — it's data-dependent and cluster-dependent. But checking the Spark UI after your first runs gives you the data to make informed tuning decisions instead of guessing.

Next: the Hive Metastore — what it is, what it does for you, and why you want your tables registered in it rather than reading from raw file paths.

Read more