Kafka 0.9 and the New Consumer API: Why Event Streaming Belongs Under Your Data Pipeline
Kafka 0.9.0 shipped in November 2015 with a rewritten consumer API, and it changes the conversation about where event streaming fits in a data pipeline architecture. The old consumer API was a Zookeeper-coupled, thread-unsafe mess that made Kafka feel like infrastructure you tolerated rather than infrastructure you built on. The new API is clean, group-aware, and finally makes Kafka a reasonable foundation for production consumers you'd actually trust.
But this post isn't really about the API changes. It's about the design argument those changes now make possible: that batch pipelines and streaming pipelines shouldn't be two separate systems, and that Kafka is the right substrate to unify them.
The Problem with Two Separate Pipelines
The typical shop in early 2016 has two data infrastructures running in parallel. The batch pipeline runs nightly: Sqoop imports from the operational database, Spark transforms the data, Hive lands the results. The streaming pipeline, if it exists, runs Kafka consumers that write aggregates to Redis or HBase for near-real-time dashboards. These two systems share no code, share no operational model, and often don't even share a team.
When the business asks "can we get this metric updated every 15 minutes instead of every night?", the answer is "that's a different system" — which translates to a six-week project, not a configuration change. When the batch pipeline and the streaming pipeline produce different numbers for the same metric, diagnosing the divergence requires understanding both systems.
Nathan Marz called this the Lambda Architecture problem and proposed a specific solution. I think the solution is right: land events in Kafka first, derive everything else from the event log.
Kafka as the Event Backbone
With Kafka as the event backbone, your producers write events to Kafka topics. Everything downstream is a consumer — including your batch pipeline. The Spark Streaming integration (now mature enough for production use with the 0.9 consumer API) lets you write a consumer that processes events in micro-batches, with configurable latency windows from seconds to hours.
from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils
sc = SparkContext("yarn", "kafka-session-aggregator")
ssc = StreamingContext(sc, batchDuration=300) # 5-minute micro-batches
# New consumer API via direct stream — no Zookeeper dependency
kafka_params = {
"bootstrap.servers": "kafka-broker-1:9092,kafka-broker-2:9092",
"auto.offset.reset": "largest"
}
events = KafkaUtils.createDirectStream(
ssc,
topics=["user-events"],
kafkaParams=kafka_params
)
page_views = (
events
.map(lambda msg: json.loads(msg[1]))
.filter(lambda e: e.get("event_type") == "page_view")
.map(lambda e: (e["user_id"], 1))
.reduceByKey(lambda a, b: a + b)
)
page_views.foreachRDD(lambda rdd: write_to_hive(rdd, partition=current_window()))
ssc.start()
ssc.awaitTermination()The same events flowing through Kafka also feed a nightly batch job that does a full recomputation over the day's data — a higher-latency, higher-accuracy pass that corrects any micro-batch approximations.
The Replay Capability
This is the part that changes the recovery story. Kafka retains messages for a configurable retention window — typically 7 days. If your consumer has a bug and produces wrong output for two days before anyone notices, you fix the consumer and replay from the point where the bad output started. No manual data reconstruction, no "we'll just wait for tomorrow's batch to fix it."
Sqoop doesn't give you replay. A cron job reading from a database snapshot doesn't give you replay. A Kafka-first architecture does.
What This Costs
Kafka is not cheap to operate. Brokers need disk, the Zookeeper ensemble needs attention, and the consumer group offset management that the new API handles automatically is still something you need to understand when things go wrong. The 0.9 consumer API removes the worst operational friction, but it doesn't make Kafka free.
For teams with modest event volumes running entirely on scheduled batch jobs, adding Kafka is genuine overhead. For teams where latency matters, where data recovery is painful, or where the batch/streaming split is becoming a staffing and maintenance problem — this is the architecture worth building toward.
If you're evaluating Kafka as an event backbone for your pipelines, I'm happy to talk through the sizing and deployment questions. As always, I'm here to help.