Spark 3.0 Preview: What Adaptive Query Execution Changes for SQL Pros
Spark 3.0 is in preview and the feature I'm watching most closely is Adaptive Query Execution. If it ships as documented, it addresses one of the most common performance frustrations in Spark 2.x: the query optimizer making bad choices because it doesn't know your data's actual distribution at plan time.
The Problem AQE Solves
In Spark 2.x, the query optimizer makes its decisions (join strategy, partition count, skew handling) based on statistics collected when the table was last analyzed. If you ran ANALYZE TABLE a month ago and your data has changed significantly since then, the optimizer is working from stale numbers.
The most common consequence: a shuffle operation produces 200 output partitions (the default) even though 195 of them have almost no data. Your query then runs 200 tasks when 5 would have been sufficient, wasting cluster time on empty work.
What AQE Changes
Adaptive Query Execution monitors actual shuffle output at runtime and adjusts the plan accordingly. Three specific behaviors:
Dynamic partition coalescing — after a shuffle, instead of running all 200 downstream tasks, AQE looks at the actual partition sizes and merges small partitions together. If 150 of your 200 partitions are under 1MB, AQE coalesces them into a smaller number of reasonably-sized tasks.
# With AQE enabled, this setting becomes a maximum, not a fixed value
spark.conf.set("spark.sql.shuffle.partitions", "200")
# AQE may coalesce those 200 down to 20 based on actual data sizeDynamic join strategy switching — the optimizer may plan a sort-merge join, but AQE can switch to a broadcast join at runtime if it discovers that one side of the join is actually small enough to broadcast. This means you get broadcast join performance without having to know your data size at plan time.
Automatic skew handling — if AQE detects that one partition in a join contains significantly more data than others (the classic data skew problem), it splits that partition and applies the join on the splits. The executor that would have been overwhelmed by the skewed partition now processes it in multiple smaller chunks.
How to Enable It
# Available in Spark 3.0 / Databricks Runtime 7.x
spark.conf.set("spark.sql.adaptive.enabled", "true")
spark.conf.set("spark.sql.adaptive.coalescePartitions.enabled", "true")
spark.conf.set("spark.sql.adaptive.skewJoin.enabled", "true")Once Databricks Runtime 7.x ships with Spark 3.0, these will likely be defaults. For now in preview, you enable them explicitly per session or per cluster.
The tuning implication: with AQE, the specific value of spark.sql.shuffle.partitions matters less. You're setting a starting point and AQE adjusts from there. The recommendation is to set it higher than you think you need and let AQE coalesce down, rather than trying to predict the right value yourself. As always, I'm here to help.