Message Queues Aren’t Event Streams: The Distinction That Changes How You Architect Data Pipelines

Your message queue works. Data goes in, data comes out, everyone is happy. Until the day a downstream consumer crashes mid-read, the queue redelivers the message, and now you have a duplicate in your database that accounting is going to find in three weeks and blame on the ETL team. Again.

The root problem is not the crash. It is that you are treating a message queue like an event stream — and they are not the same thing.

What a Message Queue Actually Guarantees

Classic message queues — RabbitMQ, MSMQ, ActiveMQ — are designed around exactly-once delivery to exactly one consumer. You pull a message, you acknowledge it, and it is gone. That is the model. It is great for work distribution: fan out tasks to workers, each task gets processed once, everybody wins.

The fatal flaw for data pipelines: once you acknowledge, the message is gone. If your consumer crashes after the ack but before it writes to the database, you are in a fun state where the queue thinks the message was processed, and your database has no record of it. The truth lives nowhere.

And there is no replay. If you need to reprocess the last 24 hours of messages because you discovered a bug in your transformation logic, you are out of luck. The queue already deleted them.

What an Event Stream Guarantees

Apache Kafka, which LinkedIn open-sourced in 2011, operates on a fundamentally different model. Instead of delivering messages to consumers and then deleting them, Kafka appends events to an immutable, ordered log — a topic. Consumers do not acknowledge and remove events; they track their position in the log (called an offset) and read from there.

This means:

  • Multiple consumers can read the same events independently, each at their own pace
  • A consumer can restart from any point in the log — including the beginning
  • You can replay the entire history of a topic to rebuild a derived dataset from scratch
  • The producer does not care about consumers at all — it just appends and moves on

For data ingestion pipelines, this is a completely different set of guarantees. The stream is the record of truth. Your database is a derived view of it.

The Data Engineering Consequences

When you are ingesting from a message queue, your consumer is stateful in a dangerous way: the moment it acknowledges, it becomes the only record that the event ever happened. Your pipeline code has to be bulletproof, or you lose data.

When you are ingesting from a Kafka topic, you have more flexibility — but you introduce a different class of problem. The topic retains events for a configurable retention period (often 7 days by default), but it does not retain them forever. And more importantly: the events are serialized. The Kafka message is bytes. Your consumer has to know the format to read it.

That is the first place where pipelines go sideways.

The Raw Zone Is Not Optional

Here is a pattern I have seen broken more times than I can count: the consumer reads a message off the topic, deserializes it directly into a domain object, and writes that to the database. No intermediate storage. Clean, simple, fragile.

It is fragile because your deserialization logic is baked directly into your ingestion path. The moment the message format changes — and it will change, because producers are not monolithic — your consumer either breaks or silently misreads the data.

There is a better way. Land the raw bytes first. Write the serialized message to storage exactly as it arrived, with a timestamp and whatever metadata the broker gives you. Then deserialize. Then transform.

# Conceptual: what a raw-first ingestion loop looks like
for message in consumer:
    # Step 1: land raw, exactly as received
    raw_store.write(
        topic=message.topic,
        partition=message.partition,
        offset=message.offset,
        timestamp=message.timestamp,
        payload=message.value  # raw bytes — untouched
    )
    # Step 2: deserialization happens in a separate job
    # Step 3: transformation happens in yet another job

Now if your message schema changes and you did not notice until three weeks later, you have options: fix the deserializer and reprocess from the raw store, or go back to the topic (if within retention) and re-land. Without the raw store, you have zero options. You reconstruct from downstream data and hope.

Queues Are Not Wrong — They Are Just Different

None of this is an argument to throw out your message queue. RabbitMQ and MSMQ are the right tool for task distribution, background job processing, and request/reply patterns where one consumer needs to handle one message and confirm it. That is a great model.

It is not a great model for data ingestion at scale, for multi-consumer pipelines, or for anything where replay and historical access matter. For those use cases, you want an event log — and right now, Kafka is the most mature option available.

The next post in this series walks through Kafka's core concepts — topics, partitions, offsets, consumer groups — using analogies that should feel familiar if you have spent time in relational databases. If you have already been burned by a queue-based ingestion pipeline, I think you will find the model refreshing. As always, I am here to help.

Read more