Spark Performance Tuning: Five Settings That Actually Move the Needle

After a few months of running Databricks in production, you develop a list of things you check when a job is slower than expected. Most of the tuning advice you'll find talks about partitioning strategy and Z-ordering, which are important — but there are a handful of configuration settings that make an immediate difference regardless of your data layout. Here are the ones I actually change.

spark.sql.shuffle.partitions

The default is 200. This controls how many partitions are created when Spark shuffles data — during joins, GROUP BY operations, and similar. For large datasets, 200 is often fine. For small datasets (say, processing a single day's orders that fit in 10 partitions before the shuffle), a post-shuffle with 200 partitions means 190 nearly-empty tasks that spend more time starting up than doing work.

# For smaller datasets, reduce this significantly
spark.conf.set("spark.sql.shuffle.partitions", "20")

# For very large datasets or complex joins
spark.conf.set("spark.sql.shuffle.partitions", "400")

Rule of thumb: target partitions that are 100–200MB each after the shuffle. Check the Spark UI → Stage Details → Shuffle Read Size to see what you're working with.

spark.databricks.delta.optimizeWrite.enabled

spark.conf.set("spark.databricks.delta.optimizeWrite.enabled", "true")

When writing to Delta, this setting consolidates small output files before writing. Without it, a job with 200 shuffle partitions writes 200 files, many of which may be tiny. With it, Delta coalesces those into fewer, larger files automatically. This is the runtime version of running OPTIMIZE afterward — it just costs a bit of extra time during the write.

Turn it on for any pipeline that writes to Delta tables that are also read frequently. The query performance improvement on reads more than pays back the slight write overhead.

spark.databricks.delta.autoCompact.enabled

spark.conf.set("spark.databricks.delta.autoCompact.enabled", "true")

Automatic compaction runs in the background after writes to compact small files. It's less aggressive than OPTIMIZE — it doesn't do full sorting — but it reduces the small-files problem without requiring you to schedule explicit OPTIMIZE runs. A reasonable default for tables that receive many small writes (streaming tables, frequently-appended tables).

Caching Lookup Tables

# If you join against a small reference table multiple times
region_lookup = spark.table("silver.regions")
region_lookup.cache()
region_lookup.count()  # Force the cache to materialize

# ... multiple operations using region_lookup ...

region_lookup.unpersist()

Caching a 10,000-row lookup table that participates in five joins saves reading it from storage five times. For reference data that doesn't change during a pipeline run, this is almost always worth it. The .count() after .cache() forces the cache to actually materialize — otherwise it's lazy and might not help as much as you expect.

Broadcast Join Threshold

# Default is 10MB — raise if your dimension tables are larger
spark.conf.set("spark.sql.autoBroadcastJoinThreshold", str(50 * 1024 * 1024))  # 50MB

If you're joining a fact table against a dimension table that's 30MB, the default threshold of 10MB means Spark does a shuffle join (expensive). Raising the threshold to 50MB allows Spark to broadcast the dimension table to all executors and skip the shuffle entirely. Check your dimension table sizes and tune accordingly. As always, I'm here to help.

Read more