Kafka 0.9 and the New Consumer API: Why Event Streaming Belongs Under Your Data Pipeline

Shannon Lowder

15 Feb 2016 — 3 min read

Kafka 0.9.0 shipped in November 2015 with a rewritten consumer API, and it changes the conversation about where event streaming fits in a data pipeline architecture. The old consumer API was a Zookeeper-coupled, thread-unsafe mess that made Kafka feel like infrastructure you tolerated rather than infrastructure you built on. The new API is clean, group-aware, and finally makes Kafka a reasonable foundation for production consumers you'd actually trust.

But this post isn't really about the API changes. It's about the design argument those changes now make possible: that batch pipelines and streaming pipelines shouldn't be two separate systems, and that Kafka is the right substrate to unify them.

The Problem with Two Separate Pipelines

The typical shop in early 2016 has two data infrastructures running in parallel. The batch pipeline runs nightly: Sqoop imports from the operational database, Spark transforms the data, Hive lands the results. The streaming pipeline, if it exists, runs Kafka consumers that write aggregates to Redis or HBase for near-real-time dashboards. These two systems share no code, share no operational model, and often don't even share a team.

When the business asks "can we get this metric updated every 15 minutes instead of every night?", the answer is "that's a different system" — which translates to a six-week project, not a configuration change. When the batch pipeline and the streaming pipeline produce different numbers for the same metric, diagnosing the divergence requires understanding both systems.

Nathan Marz called this the Lambda Architecture problem and proposed a specific solution. I think the solution is right: land events in Kafka first, derive everything else from the event log.

Kafka as the Event Backbone

With Kafka as the event backbone, your producers write events to Kafka topics. Everything downstream is a consumer — including your batch pipeline. The Spark Streaming integration (now mature enough for production use with the 0.9 consumer API) lets you write a consumer that processes events in micro-batches, with configurable latency windows from seconds to hours.

from pyspark import SparkContext
from pyspark.streaming import StreamingContext
from pyspark.streaming.kafka import KafkaUtils

sc = SparkContext("yarn", "kafka-session-aggregator")
ssc = StreamingContext(sc, batchDuration=300)  # 5-minute micro-batches

# New consumer API via direct stream — no Zookeeper dependency
kafka_params = {
    "bootstrap.servers": "kafka-broker-1:9092,kafka-broker-2:9092",
    "auto.offset.reset": "largest"
}

events = KafkaUtils.createDirectStream(
    ssc,
    topics=["user-events"],
    kafkaParams=kafka_params
)

page_views = (
    events
    .map(lambda msg: json.loads(msg[1]))
    .filter(lambda e: e.get("event_type") == "page_view")
    .map(lambda e: (e["user_id"], 1))
    .reduceByKey(lambda a, b: a + b)
)

page_views.foreachRDD(lambda rdd: write_to_hive(rdd, partition=current_window()))

ssc.start()
ssc.awaitTermination()

The same events flowing through Kafka also feed a nightly batch job that does a full recomputation over the day's data — a higher-latency, higher-accuracy pass that corrects any micro-batch approximations.

The Replay Capability

This is the part that changes the recovery story. Kafka retains messages for a configurable retention window — typically 7 days. If your consumer has a bug and produces wrong output for two days before anyone notices, you fix the consumer and replay from the point where the bad output started. No manual data reconstruction, no "we'll just wait for tomorrow's batch to fix it."

Sqoop doesn't give you replay. A cron job reading from a database snapshot doesn't give you replay. A Kafka-first architecture does.

What This Costs

Kafka is not cheap to operate. Brokers need disk, the Zookeeper ensemble needs attention, and the consumer group offset management that the new API handles automatically is still something you need to understand when things go wrong. The 0.9 consumer API removes the worst operational friction, but it doesn't make Kafka free.

For teams with modest event volumes running entirely on scheduled batch jobs, adding Kafka is genuine overhead. For teams where latency matters, where data recovery is painful, or where the batch/streaming split is becoming a staffing and maintenance problem — this is the architecture worth building toward.

If you're evaluating Kafka as an event backbone for your pipelines, I'm happy to talk through the sizing and deployment questions. As always, I'm here to help.

Kafka 0.9 and the New Consumer API: Why Event Streaming Belongs Under Your Data Pipeline

Shannon Lowder

The Problem with Two Separate Pipelines

Kafka as the Event Backbone

The Replay Capability

What This Costs

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving