Kafka at Scale: The Hidden Cost of the Monolithic Consumer

Consumer group lag is the canary in the coal mine for a Kafka pipeline under load. When lag starts rising — slowly, then faster — it usually means one of two things: the source volume increased, or the consumer is doing too much per message to keep up. The first is a capacity problem. The second is an architecture problem. They require different fixes, and confusing one for the other is expensive.

Most teams experiencing rising lag add resources — bigger cluster, more executors, higher parallelism setting. Sometimes this works. Sometimes you get a temporary improvement before lag starts rising again. Sometimes it makes no difference at all. When more resources do not help, you have an architecture problem disguised as a capacity problem.

Where the CPU Goes in a Monolithic Pipeline

A pipeline that does raw ingest, deserialization, and merge in a single pass is spending CPU on at least three classes of work simultaneously:

Deserialization CPU — parsing Avro or JSON, validating field types, applying business rules. This scales with message count, not message size. One million small messages costs one million deserializations.
Shuffle CPU — for MERGE operations, Delta needs to match incoming records against existing records. This requires a join, which requires a shuffle. Shuffle CPU scales with data volume and partition skew.
Merge I/O — reading current gold table state, writing new files, updating the transaction log. This competes with other jobs that need to read the same table.

When you add more executors to a cluster running all three workloads, you are not necessarily helping the bottleneck. If the bottleneck is deserialization CPU (each message requires expensive parsing), adding executors that are mostly idle waiting for the merge I/O to complete does not help. You are paying for compute you are not using effectively.

Measuring the Problem

Before adding resources, measure where the time is actually going. Spark's UI shows per-stage timing. For a streaming query, look at the breakdown between stages:

If the Kafka read stage dominates: you have a throughput problem. More partitions or a faster ingest path helps.
If the deserialization/validation stage dominates: you have a CPU problem. Separating deserialization into its own job (where you can tune just that workload) helps.
If the merge stage dominates: you have an I/O problem. Reducing merge frequency, improving partition key design, or optimizing the MERGE predicate helps.

# Check consumer group lag from command line
bin/kafka-consumer-groups.sh   --bootstrap-server kafka-broker-01:9092   --group monolithic-sensor-pipeline   --describe

# Healthy output: LAG = 0 or small and stable
# Unhealthy output: LAG growing over successive runs
# Very unhealthy: LAG growing faster than you are consuming

# TOPIC                PARTITION  CURRENT-OFFSET  LOG-END-OFFSET  LAG
# sensor_readings_raw  0          4823991         5102443         278452
# sensor_readings_raw  1          4800112         5098876         298764
# sensor_readings_raw  2          4811203         5099901         288698
# Lag of ~280K per partition and growing: this pipeline cannot keep up

The Architecture Pattern That Fixes It

Separate the workloads. The raw ingest job does one thing: read Kafka and write raw bytes to a Delta table. It is almost always fast enough to keep lag near zero because it does almost nothing per message.

The deserialize job reads from the raw Delta table and does the CPU-intensive parsing work. Its throughput can be independently tuned. If it falls behind, you scale it independently without touching the ingest job.

The merge job does the I/O-intensive gold table update on its own schedule. Its merge frequency can be tuned for the gold table's I/O budget without affecting the ingest or deserialize stages.

Three jobs, three independent bottlenecks, three independent scaling levers. In the next posts, we will look at what this looks like in Databricks and how the total cost compares. The short version: more jobs, lower total spend, better performance. It is not intuitive until you work through the math. I am here to help.

Kafka at Scale: The Hidden Cost of the Monolithic Consumer

Shannon Lowder

Where the CPU Goes in a Monolithic Pipeline

Measuring the Problem

The Architecture Pattern That Fixes It

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving