The first time most SQL developers look at Kafka documentation, they hit a wall around partitions. The concept exists in SQL Server too — table partitions, partition functions, partition schemes — but Kafka uses the word to mean something slightly different, and the difference matters when you are trying to figure out why your consumer is getting messages in an unexpected order.
Let me walk through the three core concepts — topics, partitions, offsets — with analogies that map to things you already know.
Topics: The Table Analogy
A Kafka topic is the closest thing to a table. It is the named container for a category of events. You have a topic for order_placed, a topic for inventory_updated, a topic for sensor_reading.
The difference from a table: the topic is append-only and the records are ordered by arrival time. There is no UPDATE, no DELETE, no WHERE clause. You append and you read. If you want the current state of an entity, you either consume all the events and build it yourself, or you use a downstream projection (a database, a Delta table) that has already done that work.
Topics also have a retention policy — by default, messages are kept for 7 days or until a configured size limit is hit. After that, they are gone. This is not an archive. It is a transit buffer with a configurable TTL. Plan accordingly.
Partitions: The Shard Analogy
Each topic is divided into one or more partitions. Think of a partition as a shard. Within a single partition, messages are strictly ordered — every message has a sequence number, and consumers always read in that sequence. Across partitions, there is no ordering guarantee.
This is the part that trips people up. If you publish order_placed for order 1001 to partition 0 and order_fulfilled for order 1001 to partition 1, a consumer reading both partitions will not necessarily see them in the order they were written. Kafka guarantees per-partition ordering, not global ordering.
The standard solution: use the entity key as the partition key. Kafka hashes the message key to determine which partition it goes to. All messages for the same key land in the same partition, so ordering is preserved for that entity.
# Producer: key by order_id to guarantee per-order ordering
from kafka import KafkaProducer
import json
producer = KafkaProducer(
bootstrap_servers=['kafka-broker-01:9092'],
key_serializer=lambda k: k.encode('utf-8'),
value_serializer=lambda v: json.dumps(v).encode('utf-8')
)
producer.send(
topic='order_placed',
key='order-1001', # partition key — all order-1001 events go to same partition
value={'order_id': 1001, 'customer_id': 42, 'total': 129.99}
)
producer.flush()
Partition count also controls your maximum consumer parallelism. If a topic has 8 partitions and you have a consumer group with 10 consumers, 2 of those consumers will be idle — there are not enough partitions to go around. More partitions = more parallelism, but also more overhead. Start with a count that matches your expected throughput and scale up with careful testing.
Offsets: The Cursor Analogy
Every message in a partition has an offset — a monotonically increasing integer. Think of it as a sequence number or a rowversion. The first message in a partition is offset 0. The second is offset 1. They never repeat, and they never go backward.
Consumers do not remove messages from a partition when they read them. They track their current offset and advance it after a successful read. This is stored in a special Kafka topic called __consumer_offsets, keyed by consumer group and topic/partition. It is how Kafka knows where a consumer left off after a restart.
This offset model is what enables replay. A consumer can reset its offset to any point in the retained log — even the beginning — and re-read every message. For rebuilding a derived dataset, this is enormously useful. For recovering from a bug in your transformation logic that corrupted three hours of data, this is the thing that saves your weekend.
from kafka import KafkaConsumer
consumer = KafkaConsumer(
'order_placed',
bootstrap_servers=['kafka-broker-01:9092'],
group_id='order_analytics_pipeline',
auto_offset_reset='earliest', # on first run, start from the beginning
enable_auto_commit=False # commit offsets manually after successful write
)
for message in consumer:
# Process message
raw_store.write(message.partition, message.offset, message.value)
# Only commit offset after the write succeeds
consumer.commit()
Note enable_auto_commit=False. Auto-commit advances the offset on a timer regardless of whether your downstream write succeeded. If your process crashes between the auto-commit and the write, that message is lost. Manual commit after a confirmed write is the safer default for data pipelines.
Consumer Groups: The Parallel Worker Analogy
A consumer group is a set of consumers that collectively read a topic. Kafka assigns each partition to exactly one consumer in the group. If you have 4 partitions and 2 consumers in the group, each consumer handles 2 partitions. Add a third consumer and the partitions rebalance — 1 consumer gets 2, the other two get 1 each.
Different consumer groups are completely independent. The analytics pipeline and the fraud detection system can both read order_placed without interfering — they are in separate groups with separate offset tracking. This is the decoupling win that makes Kafka worth the ops overhead.
Next up in this series: why the raw zone is not optional, and what happens to your pipeline when you skip it and the message format changes on you. Spoiler: it is not pretty. As always, I am here to help.