Pipeline Architecture for High-Volume Kafka: When One Job Is Not Enough

The first version of almost every streaming pipeline is a single job: read from Kafka, deserialize, apply business logic, write to the serving layer. This is fine at low volume. It starts breaking down around the point where a single cluster cannot keep up with the throughput — and the breakdown is not always what you expect.

The naive assumption is that the problem is parallelism: add more executors, add more cores, and the job should scale linearly. Sometimes it does. But for pipelines that do non-trivial work per message — deserialization, schema validation, merge logic, SCD2 lookups — the bottleneck is CPU utilization per record, not cluster size. Adding nodes to a CPU-bound job that is processing every record through three expensive operations in a single pass does not help as much as you would hope.

The Monolith Problem

Consider a pipeline that, for every Kafka message:

  1. Reads raw bytes from the topic
  2. Deserializes Avro or JSON to a structured record
  3. Validates the record against business rules
  4. Looks up current state from the gold layer (a Delta table)
  5. Computes the diff (what changed since the last version)
  6. Writes a MERGE to the gold table

Steps 4, 5, and 6 are expensive. Step 4 requires a Delta table read (or a broadcast join if the table is small enough to fit in memory). Step 5 is CPU-intensive. Step 6 is a shuffle-heavy operation. Run all six steps in a single streaming query and you have a job that is compute-intensive, memory-intensive, and I/O-intensive simultaneously.

Throwing more executors at this job helps — but less than you expect, because the bottleneck is not throughput, it is the per-record cost of the merge logic. The cluster is working hard on a small number of records rather than moving a large number of records quickly.

The Separation Solution

Split the monolith into three separate jobs, each doing less:

Job 1: Raw Ingest — reads from Kafka, writes raw bytes to Delta (raw zone). Does nothing else. This job is fast and cheap — it moves bytes and commits offsets. CPU utilization is low. It can run as an always-on streaming job or as a frequent micro-batch.

Job 2: Deserialize — reads from the raw zone Delta table, parses and validates records, writes structured records to the bronze layer. Failures go to a dead letter table. This job is CPU-intensive (deserialization, validation) but not I/O-intensive — it reads from Delta and writes to Delta. Runs as a micro-batch; can be independently scaled.

Job 3: Merge — reads from the bronze layer, performs the diff computation, and merges into the gold layer. This job is I/O-intensive (Delta reads, shuffles, merge writes) but is operating on clean, already-structured data. Runs as a micro-batch with an interval tuned to your latency requirement.

# Job 3: merge structured records into gold layer
from pyspark.sql.functions import col, max as spark_max
from delta.tables import DeltaTable

bronze_new = spark.readStream     .format("delta")     .option("checkpointLocation", "/mnt/checkpoints/merge-job")     .table("bronze.sensor_readings")

gold_table = DeltaTable.forName(spark, "gold.sensor_readings")

def merge_batch(batch_df, batch_id):
    # Delta MERGE: update existing, insert new
    gold_table.alias("gold").merge(
        batch_df.alias("new"),
        "gold.sensor_id = new.sensor_id AND gold.reading_date = new.reading_date"
    ).whenMatchedUpdateAll()      .whenNotMatchedInsertAll()      .execute()

bronze_new.writeStream     .foreachBatch(merge_batch)     .option("checkpointLocation", "/mnt/checkpoints/merge-job")     .trigger(processingTime="15 minutes")     .start()

Why This Works

Each job has a single bottleneck: Job 1 is network I/O bound (Kafka throughput), Job 2 is CPU bound (deserialization), Job 3 is Delta I/O bound (merge). You can tune each independently — scale up Job 2's executor count if deserialization is slow, increase the merge batch interval on Job 3 if the gold table is under contention.

Failures are isolated. A bug in the deserializer crashes Job 2, not Job 1 or Job 3. Raw messages keep landing (Job 1 keeps running). Existing gold records are not touched. Fix the deserializer, redeploy Job 2, and it reprocesses from where it left off. Job 3 picks up the new bronze records on its next run. No human intervention required beyond the code fix.

This architecture is more moving parts than a single job. It is also significantly more resilient, independently scalable, and debuggable. I will get into the cost comparison in the next post. I am here to help if you want to think through the split for your specific pipeline.

Read more