The Raw Zone Revisited: Why We Still Serialize Messages to Storage First

I wrote about the raw zone back in 2013, when the pattern was just becoming standard practice in the teams I worked with. The technology has changed significantly — Delta Lake, Auto Loader, Schema Registry, Unity Catalog — but every time I revisit the argument, I find the core principle holds. Land raw first. Interpret later.

This post is not a repeat of the 2013 argument. It is an update: what the raw zone looks like in 2022 with better tooling, where the old reasoning has evolved, and one place where I have actually changed my position.

What Has Changed

The tooling is much better. In 2013, "land raw" meant writing bytes to files or a staging table and managing file naming, deduplication, and cleanup manually. In 2022, Delta Lake with Auto Loader handles all of that. The checkpoint manages which files have been processed. The transaction log ensures atomicity. Time travel means you can query the raw zone as it was at any point in the past. The operational cost of maintaining a raw zone has dropped significantly.

Schema Registry reduces — but does not eliminate — the need for raw. When producers register Avro schemas and consumers look them up, schema drift is caught at deserialization time. But Schema Registry does not prevent producers from registering a breaking change and starting to publish against it. It prevents consumers from silently misreading — the consumer gets a deserialization error, not corrupted data. You still need the raw zone to recover from that error.

Delta constraints add a backstop at the serving layer. CHECK constraints on gold Delta tables reject records that violate domain rules. This is a meaningful addition. But a constraint violation at the gold layer still requires you to trace back to the source record to understand what happened. The raw zone is where that trace ends.

What Has Not Changed

The fundamental argument is unchanged: you cannot derive the raw record from any downstream representation of it. If you process a message inline and discard the original, you have committed to your interpretation of that message being correct. If you are wrong — wrong schema, wrong deserialization logic, wrong business rule — you cannot go back.

# 2022 raw zone: Delta table with Unity Catalog governance
# Schema captures everything the pipeline needs for replay

CREATE TABLE IF NOT EXISTS raw.sensor_readings_all_topics
USING DELTA
LOCATION '/mnt/datalake/raw/sensor_readings_all_topics'
TBLPROPERTIES (
    'delta.enableChangeDataFeed' = 'false',   -- append-only, no CDF needed
    'delta.logRetentionDuration' = 'interval 180 days',  -- keep transaction log
    'delta.deletedFileRetentionDuration' = 'interval 180 days'
) AS
SELECT
    topic           STRING      NOT NULL,
    partition       INT         NOT NULL,
    offset          BIGINT      NOT NULL,
    kafka_ts        TIMESTAMP   NOT NULL,
    arrived_at      TIMESTAMP   NOT NULL,
    raw_payload     STRING      NOT NULL
;

-- Unique constraint prevents re-landing the same message on replay
ALTER TABLE raw.sensor_readings_all_topics
ADD CONSTRAINT unique_kafka_position UNIQUE (topic, partition, offset);

One Position I Have Updated

In 2013 I was firm that the raw zone should store raw bytes — no parsing at all, not even UTF-8 decoding. I have softened on this for JSON payloads. Storing JSON as a string (decoded bytes, not parsed structure) is acceptable — you preserve the original representation without interpreting the structure. Storing Avro bytes as a string is not acceptable — Avro is binary and loses meaning when UTF-8 decoded. For Avro, store the bytes as a binary column or encode as base64.

# For JSON payloads: string storage is fine
col("value").cast("string").alias("raw_payload")  # UTF-8 decode, preserve JSON structure

# For Avro payloads: keep binary or base64 encode
from pyspark.sql.functions import base64
col("value").alias("raw_payload_bytes")           # binary column
# or
base64(col("value")).alias("raw_payload_b64")     # base64 string for text-friendly storage

Unity Catalog and the Raw Zone

Unity Catalog in 2022 gives the raw zone proper access control — not just file-system ACLs, but table-level permissions with audit logging. This matters: your raw zone contains original, unmasked event data. In regulated industries, access to raw personally-identifiable data should be restricted even within your data team. Unity Catalog makes that tractable. Set the raw schema to owner-only access, give the deserialize job a service principal with read access, and leave the rest of the organization working from bronze and gold where data is appropriately masked or transformed.

The raw zone is not a data quality problem or a governance headache. It is the source of truth. Treat it accordingly. I am here to help if you are designing or rethinking your raw zone architecture in the current tooling environment.

Read more