Apache Kafka is a distributed event log. That is the whole thing. Everything else — the producer API, the consumer API, the partitioning scheme, the replication — is infrastructure built to make that log fast, durable, and accessible to many readers simultaneously.
If that sounds like SQL Server's transaction log, you are not wrong. The mental model translates more directly than you would expect.
Where Kafka Came From
LinkedIn built Kafka in 2010 to solve a specific problem: they had dozens of systems that all needed to consume activity data — clicks, impressions, feed updates — and they were drowning in point-to-point integrations. Every new consumer meant a new integration. Every source change meant updating every consumer.
The solution was a durable, append-only log that producers write to and consumers read from independently. LinkedIn open-sourced it in 2011. By now it is running in enough production environments that the core concepts are worth understanding even if you are not running it yourself yet.
The Core Concepts: A SQL Developer's Translation
Topics are the highest-level organizational unit. Think of a topic as a table — except it only has an append operation, and nothing ever gets updated or deleted. You have an order_placed topic, a user_login topic, a sensor_reading topic.
Partitions are how Kafka scales. Each topic is split into one or more partitions, and each partition is an ordered, immutable sequence of records. Partitions are how Kafka distributes load across brokers (the servers in the cluster) and across consumers. The analogy to a partitioned table in SQL Server is intentional — same idea, different implementation.
Offsets are position markers. Each record in a partition has a monotonically increasing offset. Consumers track their offset — where they have read up to — and that is how they resume after a restart. Unlike a message queue, Kafka does not delete a message when a consumer reads it. The consumer just advances its offset. This is the property that makes replay possible.
Consumer groups are how you scale consumption. Multiple consumers in the same group share the partitions of a topic — partition 0 goes to consumer A, partition 1 goes to consumer B. If you want multiple independent systems to read the same topic, put them in different consumer groups. Each group gets its own independent offset tracking and sees every event.
# Simple Python consumer using kafka-python
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'order_placed',
bootstrap_servers=['kafka-broker-01:9092'],
group_id='order_ingestion_pipeline',
auto_offset_reset='earliest' # start from beginning if no offset stored
)
for message in consumer:
# message.value is raw bytes — schema is YOUR problem
raw_bytes = message.value
partition = message.partition
offset = message.offset
# Land raw first. Deserialize in a separate pass.
raw_store.write(
topic='order_placed',
partition=partition,
offset=offset,
payload=raw_bytes
)
Why SQL Developers Should Pay Attention
The Kafka model solves something that has been a latent pain in ETL work for years: source coupling.
In a typical batch setup, your SSIS package or stored procedure reaches directly into the source OLTP database. Your schedule is tied to the source system's availability. Your load patterns put read pressure on the source during business hours. When the source schema changes, your package breaks. This is the normal order of things, and it is annoying.
With Kafka in the middle, the source system publishes events to a topic. Your pipeline consumes from the topic. The source system does not know your pipeline exists. The pipeline does not care when the source system is handling peak user traffic. The topic is the contract between them.
This also means you can have multiple consumers — a real-time fraud detection system, a batch analytics pipeline, an audit log writer — all reading the same event stream independently, without touching each other or the source. In a queue model, you would need separate integrations from the source for each. With Kafka, you add a consumer group.
What Kafka Does Not Give You
No query language. Messages are bytes — Kafka has no idea what is in them. You get ordered delivery within a partition, not globally across all partitions. Default retention is time-based (7 days), not infinite — events are not stored forever unless you configure log compaction or extend the retention window.
Schema management is entirely your problem. If a producer changes the message format and nobody tells the consumers, your deserializer breaks. Silently, if your error handling is not tight. This is the argument for schema registries, and for landing raw bytes to storage before deserializing — which I covered in the previous post. The two ideas are related.
Is the Ops Overhead Worth It Right Now?
In 2012, running your own Kafka cluster means running ZooKeeper alongside it, which is its own maintenance surface. That is a real cost if you do not have infrastructure support. A small team doing primarily batch ETL work probably does not need to take this on today.
But the pattern — append-only event log, independent consumer groups, replay capability — is worth understanding even if your current implementation is a SQL Server table with a CreatedAt timestamp and a watermark query. The concept scales. The SQL table does not.
Next up in this series: MQTT — the lightweight protocol running the IoT sensors that are starting to show up in client environments — and what to do with the data stream it produces. If you are already thinking about high-frequency, high-volume message ingestion, or you have questions about the consumer model, reach out. I am here to help.