Delta Lake Schema Evolution in Practice: What Downstream Queries See When You Add a Column

Shannon Lowder

28 Feb 2020 — 2 min read

Column additions to Delta Lake tables are straightforward. You enable schema evolution, write a DataFrame with the new column, and Delta handles the rest. The part that's less documented is what happens to the systems reading that table — and understanding that is what keeps a schema change from becoming a downstream incident.

What Delta Does When You Add a Column

When you write with mergeSchema=true, Delta adds the new column to the table's schema and writes NULL for that column in all existing rows (logically — it doesn't rewrite the old Parquet files, it just marks the column as nullable in the schema with a null default for files that predate it).

df_with_discount = orders_df.withColumn("discount_amount", lit(0.0).cast("decimal(18,4)"))

df_with_discount.write \
  .format("delta") \
  .mode("append") \
  .option("mergeSchema", "true") \
  .saveAsTable("silver.orders")

From this point on, SELECT * FROM silver.orders returns discount_amount in the result. For rows written after the schema change, it has the actual value. For rows written before, it's NULL.

What Downstream Queries See

Three categories of downstream consumers, and how each is affected:

SELECT * queries — they now get the new column. If the consuming code does df.toPandas() and then accesses columns by position (e.g., row[5]), a new column shifts all subsequent positions. This is why column-position access is always wrong and column-name access is always right.

Aggregations on the new column — SUM, AVG, and similar aggregates skip NULLs automatically. AVG(discount_amount) returns the average of only the non-NULL rows, which means the average for this week if all historical rows are NULL. Depending on your use case, this might be correct or might need explicit NULL handling:

-- Treats historical NULLs as zero for the average
SELECT AVG(COALESCE(discount_amount, 0)) AS avg_discount
FROM silver.orders

NOT NULL assertions in downstream code — if any downstream pipeline or data quality check asserts that discount_amount IS NOT NULL, it will now fail for all historical rows. Check your Great Expectations suites or similar data quality definitions before deploying a schema-changing write to production.

The Safe Pattern

Before adding a column to a production Delta table:

Check DESCRIBE HISTORY to understand who has written to this table recently — those teams may be reading it too
Search for downstream notebooks and jobs that reference this table (workspace search)
Communicate the change, including that historical rows will be NULL for the new column
Update downstream data quality checks to handle the nullable column
Then run the schema-changing write

Schema evolution is a feature, not a license to change schemas without coordination. The Delta part is easy. The coordination part is the actual work. As always, I'm here to help.

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

I wrote recently about Azure Agent Mesh and OpenSharing — two infrastructure layers that between them cover how enterprises register, discover, share, and execute agents. Between them, they address a lot of the plumbing that has been missing from the enterprise agent stack. But there's a gap neither of

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

Unity AI Gateway, announced at DAIS this week, is the feature I've been waiting for since Agent Bricks shipped last year. It's a centralized governance layer for model access in Databricks — you configure which models are approved for use in your environment, who can call them,

You Don't Need Fable. You Need a Router.

The performance gap between open-weight models and closed frontier models has spent the last year collapsing faster than anyone predicted. Epoch AI's tracking puts open weights at roughly a three-to-four-month lag behind state-of-the-art closed models on average. For coding tasks, the gap has effectively closed — DeepSeek V3.2

DAIS 2026: Genie One and the Context Problem Databricks Is Solving

The central message from DAIS this week, delivered by Ali Ghodsi in the opening keynote, was direct: AI doesn't have an intelligence problem, it has a context problem. If your CFO can't get an AI system to explain why margins changed, that's not a