New Spark developers often ask why their code runs so fast, until suddenly it doesn't. They'll chain 10 transformations — filter, select, join, groupBy, withColumn — and each one completes immediately. Then they call .show() and wait 45 seconds. What's happening?
Lazy evaluation. None of those 10 transformations actually ran until the action fired.
Transformations vs. Actions
Every Spark operation falls into one of two categories:
Transformations are lazy — they define what you want to do but don't execute immediately. They return a new DataFrame that represents the logical result of the operation. No data moves. No computation happens. Spark just records the operation in the execution plan.
Transformations include: select, filter, where, groupBy, agg, join, withColumn, drop, orderBy, union, repartition, coalesce, and more.
Actions are eager — they trigger the actual computation across the cluster. Spark takes all accumulated transformations, builds an optimized physical plan, and executes it. Actions return a result to the driver (a count, a list of rows, a write confirmation).
Actions include: show(), count(), collect(), first(), take(n), write.save(), write.saveAsTable(), foreach().
The DAG: What Spark Builds Before Running
df = spark.read.parquet("abfss://container@storage.dfs.core.windows.net/events/") # read (lazy)
filtered = df.filter(F.col("event_type") == "purchase") # lazy
selected = filtered.select("user_id", "amount", "event_date") # lazy
grouped = selected.groupBy("user_id").agg(F.sum("amount")) # lazy
result = grouped.orderBy(F.col("sum(amount)").desc()) # lazy
result.show(20) # ACTION -- execution starts here
When show() fires, Spark hasn't executed a single line of the above. It builds a DAG (Directed Acyclic Graph) from all those lazy transformations, passes it through the Catalyst optimizer, which may reorder operations (push filters earlier, combine selects, choose join strategies), then executes the optimized plan.
Why This Matters: Catalyst Optimization
The reason lazy evaluation exists is optimization. If Spark executed each transformation immediately, it couldn't see the full pipeline and couldn't make global optimization decisions. By accumulating all transformations before executing, the optimizer can:
- Predicate pushdown: Move filters as early as possible — ideally into the file reader, so Spark doesn't even load rows that will be filtered out
- Column pruning: Only read columns that are actually used downstream — critical for Parquet performance
- Join reordering: If one side of a join is small enough for broadcast, Spark can choose that strategy
- Operation fusion: Combine adjacent operations that can run in a single pass over the data
This is the same class of optimization that SQL Server's query optimizer does when you submit a T-SQL query. You don't write execution order — you describe what you want, and the engine figures out the best execution strategy.
Inspecting the Plan
# See the logical and physical plan Spark will execute
result.explain()
# Verbose version with all plan stages
result.explain(True)
# In Databricks, you can also see this in the Spark UI
# under the SQL tab for the executed query
The output shows you the physical plan from bottom to top: the scan at the bottom (reading files), filters pushed into the scan, then transformations, then the final output. If you see a filter that should have been pushed down but appears higher in the plan, that's a signal something is blocking the pushdown (usually a UDF or a computation on the filtered column).
The Debugging Wrinkle
Lazy evaluation changes where exceptions surface. If you write a transformation with a type error — calling a string function on an integer column, for example — the error doesn't appear when you write the transformation. It appears when the action executes. The stack trace points to the action line, not the transformation line.
For complex pipelines, this can be disorienting. Your error is on line 47 but the exception fires at line 82 where you called .write(). Reading the exception message carefully (usually mentions the column name and expected vs actual type) points you to the actual source.
For debugging, a .count() or .show(1) after a specific transformation is a useful way to trigger execution at a specific stage and confirm that stage is working before continuing. In production code, remove those intermediate triggers — each one is an extra full pass over the data.