The first version of almost every streaming pipeline is a single job: read from Kafka, deserialize, apply business logic, write to the serving layer. This is fine at low volume. It starts breaking down around the point where a single cluster cannot keep up with the throughput — and the breakdown is not always what you expect.
The naive assumption is that the problem is parallelism: add more executors, add more cores, and the job should scale linearly. Sometimes it does. But for pipelines that do non-trivial work per message — deserialization, schema validation, merge logic, SCD2 lookups — the bottleneck is CPU utilization per record, not cluster size. Adding nodes to a CPU-bound job that is processing every record through three expensive operations in a single pass does not help as much as you would hope.
The Monolith Problem
Consider a pipeline that, for every Kafka message:
- Reads raw bytes from the topic
- Deserializes Avro or JSON to a structured record
- Validates the record against business rules
- Looks up current state from the gold layer (a Delta table)
- Computes the diff (what changed since the last version)
- Writes a MERGE to the gold table
Steps 4, 5, and 6 are expensive. Step 4 requires a Delta table read (or a broadcast join if the table is small enough to fit in memory). Step 5 is CPU-intensive. Step 6 is a shuffle-heavy operation. Run all six steps in a single streaming query and you have a job that is compute-intensive, memory-intensive, and I/O-intensive simultaneously.
Throwing more executors at this job helps — but less than you expect, because the bottleneck is not throughput, it is the per-record cost of the merge logic. The cluster is working hard on a small number of records rather than moving a large number of records quickly.
The Separation Solution
Split the monolith into three separate jobs, each doing less:
Job 1: Raw Ingest — reads from Kafka, writes raw bytes to Delta (raw zone). Does nothing else. This job is fast and cheap — it moves bytes and commits offsets. CPU utilization is low. It can run as an always-on streaming job or as a frequent micro-batch.
Job 2: Deserialize — reads from the raw zone Delta table, parses and validates records, writes structured records to the bronze layer. Failures go to a dead letter table. This job is CPU-intensive (deserialization, validation) but not I/O-intensive — it reads from Delta and writes to Delta. Runs as a micro-batch; can be independently scaled.
Job 3: Merge — reads from the bronze layer, performs the diff computation, and merges into the gold layer. This job is I/O-intensive (Delta reads, shuffles, merge writes) but is operating on clean, already-structured data. Runs as a micro-batch with an interval tuned to your latency requirement.
# Job 3: merge structured records into gold layer
from pyspark.sql.functions import col, max as spark_max
from delta.tables import DeltaTable
bronze_new = spark.readStream .format("delta") .option("checkpointLocation", "/mnt/checkpoints/merge-job") .table("bronze.sensor_readings")
gold_table = DeltaTable.forName(spark, "gold.sensor_readings")
def merge_batch(batch_df, batch_id):
# Delta MERGE: update existing, insert new
gold_table.alias("gold").merge(
batch_df.alias("new"),
"gold.sensor_id = new.sensor_id AND gold.reading_date = new.reading_date"
).whenMatchedUpdateAll() .whenNotMatchedInsertAll() .execute()
bronze_new.writeStream .foreachBatch(merge_batch) .option("checkpointLocation", "/mnt/checkpoints/merge-job") .trigger(processingTime="15 minutes") .start()
Why This Works
Each job has a single bottleneck: Job 1 is network I/O bound (Kafka throughput), Job 2 is CPU bound (deserialization), Job 3 is Delta I/O bound (merge). You can tune each independently — scale up Job 2's executor count if deserialization is slow, increase the merge batch interval on Job 3 if the gold table is under contention.
Failures are isolated. A bug in the deserializer crashes Job 2, not Job 1 or Job 3. Raw messages keep landing (Job 1 keeps running). Existing gold records are not touched. Fix the deserializer, redeploy Job 2, and it reprocesses from where it left off. Job 3 picks up the new bronze records on its next run. No human intervention required beyond the code fix.
This architecture is more moving parts than a single job. It is also significantly more resilient, independently scalable, and debuggable. I will get into the cost comparison in the next post. I am here to help if you want to think through the split for your specific pipeline.