The Spark Execution Model for SQL Pros: What Really Happens When You Run a Query
You've spent years reading execution plans. You know what a nested loop join costs. You can spot a missing index from the fat red arrow pointing the wrong direction. When something is slow in SQL Server, you click "Include Actual Execution Plan" and the answer is usually in there somewhere.
Then you run your first Databricks query and open the Spark UI. There are no operators. There are no estimated cost percentages. Instead there's something called a DAG with circles and arrows, and a column that says "Shuffle Read: 2.1 GB." You're back to being a junior developer, but in your forties.
The mental model is genuinely different, but it's not harder — it's just different vocabulary. Let me translate.
The SQL Server Model You Already Know
In SQL Server, the query optimizer takes your SQL, picks a plan (or pulls one from plan cache), and hands it to the execution engine. The engine runs operators in a pull-based pipeline: each operator pulls rows from the operator below it, transforms them, and hands them up. The execution plan shows you this tree — Hash Match here, Nested Loops there, Sort at the top before the Select.
The key thing: SQL Server is fundamentally single-machine (even with MAXDOP parallelism, you're distributing work across cores on one box). The data lives in shared memory. There's no network hop between operators.
The Spark Model
Spark runs distributed across multiple machines — workers in a cluster. The unit of work is partitioned: your table isn't one thing, it's hundreds or thousands of chunks processed independently on different executors.
When you submit a query, Spark builds a DAG (Directed Acyclic Graph) — a dependency graph of transformations. The DAG is broken into stages. A stage is a set of operations that can run without moving data between machines. When data does need to move between machines — a join, a GROUP BY, a sort — that's a shuffle, and shuffles mark stage boundaries.
Each stage is broken into tasks. One task per partition, running on one executor. If you have 400 partitions in a stage, you get up to 400 tasks running in parallel.
The hierarchy:
- Job — triggered by one action (a write, a
.show(), a.count()) - Stage — a group of tasks that don't require a shuffle between them
- Task — one partition's worth of work within a stage
- Shuffle — data movement between stages across the network; marks stage boundaries
The SQL Server Analogy That Actually Holds
Think of a shuffle like a SQL Server Sort + Hash Match + TempDB spill, all at once, but over a network. It's the expensive thing. The SQL Server optimizer minimizes nested loops for small row counts to avoid sorts — in Spark, you're minimizing shuffles for the same reason.
Stage boundaries in the Spark UI map roughly to the "heavyweight" operators in a SQL Server plan. When you see the stage count go from 2 to 8, something is shuffling a lot of data.
Reading EXPLAIN in Spark SQL
Spark SQL has an EXPLAIN command and it's more useful than it looks at first:
EXPLAIN SELECT
o.customer_id,
COUNT(*) AS order_count,
SUM(o.amount) AS total_amount
FROM orders o
JOIN customers c ON o.customer_id = c.customer_id
WHERE c.region = 'West'
GROUP BY o.customer_id;Add EXTENDED to see the full physical plan, or COST for estimated statistics. The output reads bottom-up (like SQL Server's execution plan), with the deepest scan at the bottom and the final projection at the top.
Key operators to recognize:
FileScan— reading from a table. Check "PushedFilters" — predicates pushed down to the file reader, roughly equivalent to a seek vs. a full scan.Exchange hashpartitioning— a shuffle. This is your expensive-thing marker. If you see it where you don't expect it, you have a data distribution problem.SortMergeJoin— a join that required shuffling both sides first. Expensive. Compare toBroadcastHashJoin, which sends the smaller table to all workers and avoids the shuffle entirely.HashAggregate— partial aggregation local to each partition before the shuffle, then a final pass after. Spark being smart about reducing shuffle volume for GROUP BY.
The Spark UI: Where to Actually Look
In Databricks, find the Spark UI link in the cluster panel during or after a run. Go to the Stages tab. What to look for per stage:
- Shuffle Read / Write size — if this is gigabytes for what should be a simple aggregation, you have a data skew or partition count problem.
- Task duration variance — if most tasks finish in 50ms and one takes 4 minutes, that's a skewed partition. One worker is doing most of the work. This is the distributed-system equivalent of a parallelism-killing implicit conversion in SQL Server.
- Spill — Spark writing intermediate data to disk mid-stage means executor memory is too small for the partition size. Same concept as TempDB spill, same consequence: dramatic slowdown.
What .cache() Does (and When It Matters)
In SQL Server, buffer pool caching is automatic and managed by the engine. In Spark, caching is explicit. When you call .cache() on a DataFrame or run CACHE TABLE orders in Spark SQL, you're telling Spark to materialize that data in executor memory so downstream operations don't re-read from storage.
This matters when you're referencing the same DataFrame multiple times in a session. Without caching, Spark re-reads the files from scratch for each action. For a large table you join against three times, that's three full file scans. Cache it once.
The tradeoff: cached data consumes executor memory. Cache too aggressively on a small cluster and you starve the executors doing actual computation. Unpersist explicitly when you're done:
orders_df.cache()
# ... multiple operations referencing orders_df ...
orders_df.unpersist()The Mental Model Shift
In SQL Server: optimize for memory (buffer pool hits), minimize sort operations, eliminate unnecessary scans. In Spark: optimize for shuffles (minimize data movement between machines), balance partition sizes, watch for skew.
The analogy holds more than it breaks. The vocabulary is different but the instincts transfer. When something is slow in Databricks, the answer is almost always in the Stages tab — look at shuffle size, look at task duration variance, look for spill. It's execution plan reading with different terms.
Once you can navigate the Spark UI confidently, the rest of the performance conversation — partitioning strategy, Z-ordering, cluster sizing — starts to make a lot more sense. We'll dig into each of those over the coming months.
As always, I'm here to help.