If your streaming processing requirements are straightforward — filtering, aggregating, joining two topics — you probably do not need a separate compute cluster. Kafka Streams, introduced in Kafka 0.10, is a Java library that runs stream processing inside your application process. No YARN. No separate cluster. No Zookeeper dependency beyond what Kafka already has. You add a dependency to your build file and you are writing stream processing code.
This is a significant operational simplification for moderate use cases. The tradeoff is that Kafka Streams is JVM-only, it does not have the rich SQL interface of Spark, and its operational model (scaling by adding application instances, not by adding cluster nodes) is different enough to require a mental model shift.
The Core Abstractions
KStream is an unbounded stream of records. Every message is an independent event. Joining two KStreams means finding records with the same key within a configurable time window. Aggregating a KStream means accumulating state over a window (tumbling, hopping, or session windows).
KTable is a changelog — a stream where each record represents the current value for a key. The latest record for each key wins. Reading from a KTable gives you the current state; the underlying storage is a RocksDB instance local to the application process. A KTable can be materialized (queryable via interactive queries) or used as a lookup table for stream-table joins.
// Java: basic Kafka Streams topology
StreamsBuilder builder = new StreamsBuilder();
// Source stream: raw sensor readings topic
KStream<String, String> rawReadings = builder.stream("sensor_readings_raw");
// Filter out readings with null or empty payload
KStream<String, String> validReadings = rawReadings.filter(
(sensorId, payload) -> payload != null && !payload.isEmpty()
);
// Branch: high-priority sensors to a fast-path topic, others to standard
Map<String, KStream<String, String>> branches = validReadings.split()
.branch((id, payload) -> HIGH_PRIORITY_SENSORS.contains(id), Branched.as("high-priority"))
.defaultBranch(Branched.as("standard"));
branches.get("sensor_readings_raw-high-priority").to("sensor_readings_priority");
branches.get("sensor_readings_raw-standard").to("sensor_readings_standard");
KafkaStreams streams = new KafkaStreams(builder.build(), config);
streams.start();
Scaling Without a Cluster
Kafka Streams scales by parallelism within the partition model you already have. If your source topic has 8 partitions and you start 4 instances of your Streams application (same application.id), each instance handles 2 partitions. Add more instances and the partitions rebalance automatically. Remove an instance and its partitions are redistributed. No cluster manager involved.
The limitation: you cannot have more active instances than partitions. If your topic has 8 partitions, the 9th instance sits idle. Design your partition count with your scaling needs in mind.
Stateful Processing and Local State
Aggregations and joins require state. Kafka Streams manages this state in local RocksDB stores and backs them up to a compacted Kafka changelog topic. This means state survives application restarts — on restart, the instance replays its changelog topic to rebuild the local store before resuming consumption.
// Counting orders per customer in a 1-hour tumbling window
KTable<Windowed<String>, Long> orderCountsPerCustomer = rawReadings
.groupByKey()
.windowedBy(TimeWindows.ofSizeWithNoGrace(Duration.ofHours(1)))
.count(Materialized.as("order-counts-store"));
// Query the local store interactively (without going through a topic)
ReadOnlyWindowStore<String, Long> store = streams.store(
StoreQueryParameters.fromNameAndType("order-counts-store", QueryableStoreTypes.windowStore())
);
WindowStoreIterator<Long> iterator = store.fetch("customer-42",
Instant.now().minus(Duration.ofHours(1)), Instant.now());
while (iterator.hasNext()) {
KeyValue<Long, Long> next = iterator.next();
System.out.println("Window: " + next.key + ", Count: " + next.value);
}
When Kafka Streams Is the Right Tool
Kafka Streams is the right tool when: your processing logic can be expressed as a topology of Kafka operations (filter, map, join, aggregate), your team is comfortable with Java or Kotlin, you want operational simplicity (no separate compute cluster), and your throughput fits within the partition-based scaling model.
It is not the right tool when: you need SQL-based analytics over streams, you need to join a stream against a very large dataset that does not fit in local state, or you need the rich ecosystem of Spark (MLlib, Spark SQL, pandas integration). In those cases, Structured Streaming is a better fit — and we will get there later in this series. For now, if you are building light to medium stream processing in Java and want to skip the cluster, Kafka Streams is worth a serious look. I am here to help.