Delta Lake Write Modes: Append, Overwrite, Merge, and When to Use Each

When you write to a Delta table, you have four modes to choose from. Most people discover them by trial and error — they try overwrite, wonder why their data disappeared, then slowly piece together when each mode is appropriate. This post is the thing I wish I'd read before that happened to me.

The Four Write Modes

# Append: add rows to the table, never touch existing rows
df.write.format("delta").mode("append").saveAsTable("orders")

# Overwrite: replace all existing rows with the new data
df.write.format("delta").mode("overwrite").saveAsTable("orders")

# Error (default): fail if the table already exists
df.write.format("delta").mode("error").saveAsTable("orders")

# Ignore: do nothing if the table already exists
df.write.format("delta").mode("ignore").saveAsTable("orders")

Or using Spark SQL:

-- Append
INSERT INTO orders SELECT * FROM staging_orders

-- Overwrite entire table
INSERT OVERWRITE orders SELECT * FROM staging_orders

Append: For Incremental Loads

Append adds new rows without touching what's already there. It's the right choice for:

  • Event logs and immutable transaction tables
  • Incremental loads where you're adding new periods of data
  • Streaming writes where data arrives continuously

The catch: append doesn't enforce uniqueness. If you run the same pipeline twice, you get duplicate rows. Make sure your upstream process guarantees idempotency or handle deduplication downstream.

Overwrite: For Full Refreshes

Overwrite replaces the entire table contents. It's the right choice for:

  • Dimension tables that are fully rebuilt on each run
  • Aggregated summary tables recalculated from scratch
  • Any table where you want to guarantee the current run's data is authoritative

Overwrite in Delta behaves differently from overwrite in plain Parquet. Plain Parquet overwrite deletes all existing files and writes new ones — if the write fails partway through, the table is empty or partially written. Delta overwrite is atomic: it creates new data files and swaps them in via the transaction log in a single commit. The old version is still accessible via time travel until VACUUM removes it.

Overwrite With Schema

If your new DataFrame has a different schema than the existing table, a plain overwrite will fail by default. To allow schema changes during an overwrite:

df.write.format("delta") \
  .mode("overwrite") \
  .option("overwriteSchema", "true") \
  .saveAsTable("orders")

This replaces both the data and the schema. Use it deliberately — downstream queries that depended on the old schema will break.

Merge: For Upserts

For the pattern SQL Server DBAs use most — update existing rows, insert new ones — Delta has MERGE:

from delta.tables import DeltaTable

target = DeltaTable.forName(spark, "orders")

target.alias("t").merge(
    staging_df.alias("s"),
    "t.order_id = s.order_id"
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

MERGE is covered in more depth in its own post, but the short version: it's MERGE in T-SQL with slightly different syntax, and the join predicate should match on a column that's part of your partition scheme if you want it to perform well.

Choosing the Right Mode

The decision is usually straightforward:

  • New rows only → append
  • Full rebuild from source → overwrite
  • Update existing + insert new → MERGE
  • Table should only exist once → error (default, use for first-run initialization)

The one that catches people is using overwrite when they meant append. If you're running an incremental load and you accidentally use overwrite, you just replaced your historical data with today's data. Delta's time travel means you can recover from that — but it's better not to have to. As always, I'm here to help.

Read more