Checkpoint Files and Idempotency: Getting Streaming Restarts Right

Checkpoint files are the most misunderstood part of a Structured Streaming pipeline. They are not a cache. They are not a nice-to-have. They are the mechanism that makes exactly-once delivery possible. Delete them and you will either duplicate your data, gap your data, or both, depending on where your job was in its processing cycle when the files disappeared.

This post is about what checkpoints actually do, why idempotent writes matter, and how to build a pipeline that survives restarts gracefully regardless of how they happen.

What Checkpoint Files Contain

A Spark Structured Streaming checkpoint directory contains several subdirectories:

  • offsets/ — the Kafka offsets (or file positions for Auto Loader) committed at the end of each micro-batch. Indexed by batch ID.
  • commits/ — confirmation that the write for each batch ID completed successfully.
  • metadata — the streaming query's configuration, including the schema and trigger settings.
  • state/ — for stateful operations (aggregations, joins). Contains the in-memory state materialized to disk.

On restart, Spark reads offsets/ to find the last committed source position and commits/ to find the last confirmed write. If offset N is in offsets/ but not in commits/, that batch did not complete. Spark re-reads from offset N-1 and retries the write for that batch.

The Exactly-Once Contract

Structured Streaming provides exactly-once end-to-end guarantees under one condition: the output sink must support idempotent writes. Delta Lake does this via its transaction log — writing the same batch ID twice is detected and ignored (the second write is a no-op). Plain Parquet, JDBC, or a custom foreach sink do not have this guarantee unless you build it yourself.

# Delta: idempotent by default
df.writeStream     .format("delta")     .option("checkpointLocation", "/mnt/checkpoints/my-job")     .start("/mnt/datalake/raw/sensor_readings")

# JDBC: NOT idempotent by default
# If the write partially completes and retries, you get duplicates
# Must handle this yourself with a transaction or upsert logic
def write_to_jdbc_idempotently(batch_df, batch_id):
    # Use batch_id as an idempotency key
    # Delete existing records for this batch_id, then insert
    batch_df.withColumn("batch_id", lit(batch_id))             .write             .mode("append")             .jdbc(jdbc_url, table="sensor_readings_staged")

If you are writing to anything other than Delta, audit your sink for idempotency before going to production. "It worked in testing" is not sufficient — testing rarely exercises the restart path, which is exactly when idempotency is required.

Safe Checkpoint Management

Several operations that seem reasonable will break your pipeline:

Deleting the checkpoint to "reset" a job. If you delete checkpoints and re-read from the beginning (startingOffsets=earliest), you will write all previously processed records again. If your output sink is a Delta table and you did not clear the table, you have duplicates.

Sharing a checkpoint between two jobs. Two jobs with the same checkpoint location will corrupt each other's offset tracking. Each streaming query needs its own dedicated checkpoint directory.

Moving the checkpoint directory. The checkpoint contains the job's query ID embedded in metadata. Moving the directory to a new path and pointing the job at it sometimes works, but the streaming query can detect the path change and refuse to start. Always use the same path a job was originally configured with.

# Safe approach: checkpoints in a structured, job-named location
CHECKPOINT_BASE = "/mnt/checkpoints"

streaming_jobs = {
    "sensor-raw-ingest":     f"{CHECKPOINT_BASE}/sensor-raw-ingest",
    "sensor-deserialize":    f"{CHECKPOINT_BASE}/sensor-deserialize",
    "sensor-merge-gold":     f"{CHECKPOINT_BASE}/sensor-merge-gold"
}

# Each job gets a unique, stable, never-shared checkpoint path

When You Need to Truly Reset

Sometimes you genuinely need to reprocess from the beginning — you discovered a bug in Stage 1 that landed bad data, and you need to rebuild the raw zone from scratch. The safe sequence:

  1. Stop the streaming job
  2. Clear or truncate the output table (or write to a new table path)
  3. Delete the checkpoint directory
  4. Set startingOffsets=earliest in the job configuration
  5. Restart — the job will reprocess from the beginning

Do not skip step 2. Deleting the checkpoint without clearing the output is how you get duplicate data in your raw zone, which then propagates through deserialize and merge. Clear the output first, then reset the checkpoint. I am here to help if you have a specific restart scenario you are working through.

Read more