Building a Cost-Aware Streaming Architecture: Lessons After a Decade of Kafka

I have been writing Kafka-based pipelines since 2012. The tooling in 2022 is dramatically better — Delta Lake, Databricks Auto Loader, Unity Catalog, Schema Registry, the Event Hubs Kafka protocol. The architecture principles, on the other hand, have changed remarkably little. The things that made pipelines fail in 2013 still make pipelines fail today. The things that made them resilient then still make them resilient now.

This post is a retrospective on what has actually mattered — the decisions that determined cost and reliability in every streaming engagement I have worked on over the past decade.

Lesson 1: Land Raw Before You Interpret

Every year I think this principle will stop being necessary to repeat. Every year I work with a team that skipped it and is suffering the consequences. The raw zone is not overhead. It is the foundation that everything else is built on. If you do not have it, you are one schema change from a data repair project.

The cost of maintaining a raw zone in 2022 is negligible. Delta Lake storage costs pennies per GB per month. The operational complexity is handled by Auto Loader. There is no longer a credible argument that the overhead is not worth it. Land raw first, always.

Lesson 2: Separate Your Workloads

The monolith is tempting because it is simpler to reason about on day one. One job, one cluster, one monitoring dashboard. But a streaming pipeline that does ingestion, deserialization, and merge in a single job is three workloads with different resource profiles, failure modes, and scaling characteristics crammed into one unit that cannot be independently tuned, independently failed, or independently recovered.

Three separate jobs is more moving parts. It is also independently scalable, independently debuggable, and independently redeployable. When the deserialize job needs a schema fix, the ingest job keeps landing data. When the merge job needs tuning, the deserialization backlog is stable. Each job can be sized and scheduled for its specific workload, which is why the three-job architecture almost always costs less than the equivalent monolith despite running more individual jobs.

Lesson 3: Micro-Batch Is Almost Always Enough

Ten years of streaming pipelines and I can count on one hand the number of use cases that genuinely required sub-minute latency. Most analytics, reporting, and machine learning feature pipelines are fine with 5, 15, or 30-minute freshness. The assumption that "streaming = always-on consumer" has cost clients enormous amounts of money for latency they never needed.

Before you commit to an always-on streaming cluster, ask the question: what is the actual latency requirement, and who is the human or system that will notice if data is 15 minutes stale instead of 15 seconds stale? If you cannot name them, micro-batch is probably your architecture.

Lesson 4: Monitor Consumer Lag, Not Just Pipeline Success

A pipeline that "ran successfully" but took 45 minutes to process 15 minutes of data is falling behind. A pipeline that completed in 3 minutes but missed 10,000 messages because of a retention window expiry is losing data. Pipeline job status (success/failure) tells you whether the job ran. Consumer lag tells you whether the pipeline is keeping pace. Both metrics are necessary. Lag is the one that gets set up last and matters most.

Lesson 5: The Dead Letter Queue Is Not Optional

Every production streaming pipeline will eventually encounter a message it cannot process. The question is not whether this will happen, but whether your pipeline handles it gracefully or crashes and blocks. A dead letter queue that captures the raw message, the error, and the source position gives you the ability to investigate and replay without data loss. A pipeline without one will eventually grind to a halt on a single malformed message at 2am.

The Constants

Kafka has gone from a niche LinkedIn project to the de facto standard for event streaming. Delta Lake has changed what "reliable streaming output" looks like. Cloud managed services (Event Hubs, Confluent Cloud) have made Kafka accessible to teams without infrastructure engineering capacity. The ecosystem has matured significantly.

The five lessons above are the same lessons I would have written in 2015. Better tooling makes them easier to implement, not less relevant. The pipelines that follow them work. The pipelines that skip them create incidents.

If you are starting a new streaming pipeline or rethinking an existing one, I am happy to think through the architecture with you. After a decade of this, the patterns are clear and the tradeoffs are predictable. As always, I am here to help.

Read more