Backpressure in Spark Streaming: Reading Kafka at the Rate Your Pipeline Can Handle
Spark Structured Streaming can read from Kafka indefinitely — theoretically. The practical limit is how fast the downstream pipeline can process incoming messages relative to how fast the upstream producers are sending them. When those rates diverge, you have a backpressure problem: more data arriving than the consumer can handle, queues growing, lag increasing, eventually pipeline failure.
Structured Streaming has mechanisms for this. Understanding them is the difference between a streaming pipeline that degrades gracefully under load and one that falls over.
What Backpressure Looks Like
The signal is consumer group lag — the difference between the latest offset on the Kafka topic and the consumer's current offset. A lag that's stable or decreasing is fine. A lag that's consistently growing means your pipeline is not keeping up. Left unaddressed, it will grow until the pipeline is minutes or hours behind real time, which usually defeats the purpose of streaming in the first place.
Rate Limiting with maxOffsetsPerTrigger
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# Read from Kafka with a rate cap
orders_stream = (
spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "kafka-broker:9092")
.option("subscribe", "orders.raw")
.option("startingOffsets", "latest")
.option("maxOffsetsPerTrigger", 50000) # Process at most 50K records per micro-batch
.load()
)
maxOffsetsPerTrigger is a ceiling, not a floor. If only 1,000 records arrived since the last micro-batch, you process 1,000. If 200,000 arrived, you process 50,000 and the rest wait for the next trigger. This keeps individual micro-batch processing time predictable and prevents the executor from being overwhelmed by a sudden burst.
Choosing the Right Cap
The cap should be set based on how long you can tolerate a single micro-batch taking. If your micro-batch processing time starts exceeding your trigger interval (the time between batches), lag will accumulate. The math:
# Rough sizing guide
# Target: micro-batch completes in < trigger_interval
# Measure how long a micro-batch takes at different record counts
# (run with small cap first, observe timing in Spark UI)
# If 10K records takes ~30 seconds and trigger interval is 30s:
# - 10K is about right as a cap
# - Higher cap risks batches taking longer than the interval
# - Lower cap means you're not using full cluster capacity
# For variable load (quiet periods + traffic spikes):
# - Let the cap absorb the spike, not the cluster
spark.readStream.option("maxOffsetsPerTrigger", 10000)
The Trigger Interval
from pyspark.sql.streaming import DataStreamWriter
# Default: trigger as fast as possible (no sleep between batches)
# This maximizes throughput but also maximizes resource consumption
# Fixed interval trigger: wait N seconds between batch starts
query = (
orders_stream
.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/orders")
.trigger(processingTime="30 seconds") # Batch every 30s
.start()
)
# Trigger once — run a single batch and stop
# Useful for scheduled "micro-batch" workloads
query = (
orders_stream
.writeStream
.format("delta")
.option("checkpointLocation", "/mnt/checkpoints/orders")
.trigger(once=True)
.start()
.awaitTermination()
)
Monitoring Lag
The streaming query's progress object exposes the offset information you need to monitor lag. Log it, alert on it, track it over time:
import json
while query.isActive:
progress = query.lastProgress
if progress:
sources = progress.get('sources', [])
for source in sources:
end_offset = source.get('endOffset', {})
start_offset = source.get('startOffset', {})
# endOffset - startOffset = records processed this batch
num_records = progress.get('numInputRows', 0)
batch_duration = progress.get('batchDuration', 0)
print(f"Batch {progress['batchId']}: {num_records} records in {batch_duration}ms")
import time
time.sleep(30)
The combination of maxOffsetsPerTrigger and a fixed trigger interval gives you a pipeline that processes data at a predictable rate, doesn't blow up during traffic spikes, and falls behind gracefully when load exceeds capacity — then catches up when the spike subsides. That's the behavior you want in production. As always, I'm here to help.