Designing the Three-Stage Streaming Pipeline: Ingest, Deserialize, Transform

The three-stage streaming pipeline is not a new idea — it is the streaming equivalent of a well-understood batch ETL pattern. In batch, you stage raw files before loading them. In streaming, you land raw messages before deserializing them. The principle is the same; the implementation details differ because the data is arriving continuously instead of in scheduled drops.

I keep coming back to this pattern because I keep seeing pipelines built without it. A single job that reads from Kafka, deserializes the message, applies business logic, and writes to the serving layer — all in one pass. It works until the schema changes, or the transformation logic has a bug, or the serving layer is briefly unavailable. Then everything breaks at once and you do not know which part failed.

Stage 1: Raw Ingest

The raw ingest job has one responsibility: read messages from the Kafka topic and write them to durable storage, unchanged. No parsing. No transformation. No validation beyond "is this a valid byte sequence."

What "durable storage" looks like depends on your stack. In 2015, common choices are:

Azure Blob Storage or ADLS (especially for Event Hubs pipelines)
HDFS (for on-premises Hadoop clusters)
A staging SQL Server table with a raw_payload NVARCHAR(MAX) column
A local filesystem with a reliable backup (not recommended for anything critical)

The key properties of your raw store: it must be durable (survives machine failures), it must be queryable by arrival time (so stage 2 can find the messages it has not yet processed), and it must be append-only (so you can reprocess without risk of overwriting what you are reprocessing from).

-- Raw zone table in SQL Server (works for moderate volumes)
CREATE TABLE kafka_raw_landing (
    id              BIGINT IDENTITY(1,1) PRIMARY KEY,
    source_topic    NVARCHAR(256)   NOT NULL,
    partition_id    INT             NOT NULL,
    partition_offset BIGINT         NOT NULL,
    arrived_at      DATETIME2       NOT NULL DEFAULT SYSUTCDATETIME(),
    raw_payload     NVARCHAR(MAX)   NOT NULL,
    processing_status NVARCHAR(32) NOT NULL DEFAULT 'pending',
    UNIQUE (source_topic, partition_id, partition_offset)  -- idempotent writes
);

The UNIQUE constraint on topic/partition/offset means you can safely replay the ingest job — duplicate messages are rejected rather than double-landed.

Stage 2: Deserialize

The deserialize job reads from the raw zone, parses messages into structured records, and writes them to a structured staging area. This is where schema enforcement happens. This is where you catch format changes. This is where bad messages go to a dead letter store rather than crashing the pipeline.

-- Structured staging table (populated by the deserialize job)
CREATE TABLE sensor_readings_staged (
    raw_id          BIGINT          NOT NULL,  -- FK back to raw_landing
    sensor_id       NVARCHAR(64)    NOT NULL,
    reading_value   DECIMAL(18,4)   NOT NULL,
    reading_unit    NVARCHAR(16)    NOT NULL,
    reading_ts      DATETIME2       NOT NULL,
    deserialized_at DATETIME2       NOT NULL DEFAULT SYSUTCDATETIME()
);

CREATE TABLE deserialize_errors (
    raw_id          BIGINT          NOT NULL,
    error_message   NVARCHAR(MAX)   NOT NULL,
    failed_at       DATETIME2       NOT NULL DEFAULT SYSUTCDATETIME()
);

The deserialize job runs as a micro-batch — every N minutes, process all pending rows in kafka_raw_landing. Successes move to sensor_readings_staged. Failures move to deserialize_errors with the error message. Both outcomes update processing_status in the raw table so the record is not re-processed on the next run.

When the schema changes and the deserializer fails, you fix the deserializer, reset the affected records back to pending, and re-run. The raw data is still there. Nothing is lost.

Stage 3: Transform and Merge

The transform job reads from the structured staging area and writes to your serving layer — a fact table, a Delta table, a data mart. This is where business logic lives: slowly changing dimension handling, deduplication, derived column calculations, surrogate key lookup.

Separating this from stage 2 matters because transform logic changes more frequently than deserialization logic. If you bundle them, every transform change requires retesting the deserializer. Kept separate, each stage has a clear interface (the structured staging table) and can be independently deployed, tested, and replayed.

The Operational Win

The pattern sounds like more moving parts, and it is. But the operational properties are better:

Stage 1 fails: no data lost, messages still in Kafka (within retention window)
Stage 2 fails: no data lost, raw messages in raw zone, just unprocessed
Stage 3 fails: no data lost, structured records in staging, just un-merged
Schema changes: fix stage 2, replay from raw zone, stages 1 and 3 unchanged
Transform bug: fix stage 3, replay from structured staging, stages 1 and 2 unchanged

Compare that to the monolithic approach, where any of these failures requires you to figure out how far the processing got, reconstruct state, and decide whether to replay from Kafka (if the data is still within retention) or from a backup (if it is not).

Three stages. Clear interfaces. Independent failure domains. I have built this pattern across enough client environments to stop questioning it. I am here to help if you want to adapt it to your specific stack.

Designing the Three-Stage Streaming Pipeline: Ingest, Deserialize, Transform

Shannon Lowder

Stage 1: Raw Ingest

Stage 2: Deserialize

Stage 3: Transform and Merge

The Operational Win

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving