Kafka Retention, Replay, and Recovery: What Happens When Your Consumer Falls Behind

Kafka's retention model is one of the properties that separates it from a traditional message queue, and it is also one of the properties that most often surprises people the first time their consumer goes down for a few days.

Here is the scenario: your raw landing consumer crashes on a Saturday night. Nobody notices until Monday morning. The Kafka topic has a 7-day retention window, so the messages are still there. You restart the consumer, it picks up from where it left off, and by Monday afternoon you have caught up. Crisis averted.

Now imagine the same scenario, but your retention window is 48 hours. You had a long weekend. You come back Tuesday morning and 18 hours of messages are gone. That is the version where you file an incident report.

How Retention Works

Kafka retention is configured at the topic level and can be set by time, by size, or both. The defaults (configurable per broker) are typically 7 days or 1 GB, whichever comes first. When the retention limit is hit, Kafka deletes the oldest log segments — not individual messages, but whole segments (typically 1 GB chunks of the partition log). This means you lose time windows, not individual records, when retention kicks in.

You can change retention settings on a live topic without downtime:

# Extend retention to 14 days on a running topic
bin/kafka-topics.sh --bootstrap-server kafka-broker-01:9092   --alter --topic order_placed   --config retention.ms=1209600000   # 14 days in milliseconds

# Check current retention settings
bin/kafka-topics.sh --bootstrap-server kafka-broker-01:9092   --describe --topic order_placed

For critical ingestion topics where the raw zone is your safety net, set retention to at least 2-3x your longest expected consumer downtime. If your on-call team is guaranteed to respond within 48 hours, set retention to at least 7 days. Give yourself room.

Log Compaction: The Alternative to Time-Based Retention

Time-based retention keeps every message until its time expires. Log compaction is different: it keeps only the last message for each key, forever. This turns your Kafka topic into something more like a key-value store — you can always read the current state of any keyed entity, but you cannot read the full history.

Log compaction is useful for topics that represent the current state of something (user profile updates, configuration changes) rather than a history of events (sensor readings, transactions). For data pipeline ingestion, time-based retention is almost always what you want — you need the full history, not just the latest value.

Consumer Lag: The Warning Metric

Consumer lag is the number of messages between where a consumer currently is (its committed offset) and where the end of the partition is (the latest offset). Lag of zero means the consumer is caught up. Lag of 100,000 means it has 100,000 messages to process before it reaches the current end of the log.

Lag is the primary health metric for a streaming pipeline. Sustained rising lag means your consumer is not keeping up. You want to alert on lag before it becomes a retention problem — if your lag grows to the point where the oldest un-consumed messages are within your retention window's expiry, you will start losing data.

# Check consumer group lag from the command line
bin/kafka-consumer-groups.sh   --bootstrap-server kafka-broker-01:9092   --group order_raw_landing   --describe

# Output shows: GROUP, TOPIC, PARTITION, CURRENT-OFFSET, LOG-END-OFFSET, LAG
# LAG = LOG-END-OFFSET - CURRENT-OFFSET
# Any partition with LAG > 0 and growing needs attention

Replay: The Recovery Procedure

Replay is the operation of re-reading messages from an earlier offset. You need it when your downstream write had a bug, when your raw zone had a corruption, or when you are rebuilding a projection from scratch. The procedure is straightforward: reset the consumer group offset to the desired starting point and restart the consumer.

# Reset consumer group to the beginning of the topic
bin/kafka-consumer-groups.sh   --bootstrap-server kafka-broker-01:9092   --group order_raw_landing   --topic order_placed   --reset-offsets --to-earliest   --execute

# Or reset to a specific time (e.g., Saturday midnight)
bin/kafka-consumer-groups.sh   --bootstrap-server kafka-broker-01:9092   --group order_raw_landing   --topic order_placed   --reset-offsets --to-datetime 2014-05-10T00:00:00.000   --execute

Important: the consumer group must be stopped (no active consumers) before you can reset its offsets. If you try to reset while consumers are running, the command will fail or the consumers will immediately re-commit the old offset and override your reset.

Also worth noting: if you replay to a position before messages that have already been written to your raw zone, you will write duplicates. Your raw zone should handle this gracefully — either by deduplicating on offset (partition + offset is a natural primary key) or by idempotently overwriting existing records.

Retention, lag monitoring, and replay are the operational backbone of a Kafka-based ingestion pipeline. Get comfortable with these tools before you go to production. I am here to help if you want to walk through a recovery scenario on your specific setup.

Kafka Retention, Replay, and Recovery: What Happens When Your Consumer Falls Behind

Shannon Lowder

How Retention Works

Log Compaction: The Alternative to Time-Based Retention

Consumer Lag: The Warning Metric

Replay: The Recovery Procedure

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving