Schema Registry: Managing Message Formats Before They Break Your Pipeline

Here is a failure mode I have seen enough times to write a post about it. A producer team updates their message schema — adds a field, renames one, changes a type from string to integer. They tell nobody. The consumer deserialization code breaks. Or worse, it does not break — it silently reads the wrong values and produces corrupted output downstream.

Kafka does not enforce schemas. It moves bytes. What those bytes mean is entirely your problem. Schema Registry is the solution: a centralized service that stores versioned schemas and lets producers and consumers negotiate compatibility before messages are written or read.

What Schema Registry Does

Confluent Schema Registry (the dominant implementation, open source) runs as a separate service alongside your Kafka cluster. Producers register a schema before publishing. The serializer checks the schema, assigns it a numeric ID, and embeds that ID in the first 5 bytes of every message. Consumers read the ID from the message, look up the schema from the registry, and use it to deserialize correctly — even if their local schema is a different version.

The registry enforces compatibility rules per subject (typically one per topic). The most common rule is backward compatibility: a new schema version must be readable by consumers using any previous version. This means you can add optional fields freely, but you cannot remove fields or change types without breaking existing consumers.

Compatibility Modes

  • BACKWARD — new schema can read data written with old schema. Consumers can upgrade first. Default.
  • FORWARD — old schema can read data written with new schema. Producers can upgrade first.
  • FULL — both directions. Safest, most restrictive.
  • NONE — no compatibility checking. Not recommended for production.
# Register a schema (Avro) via the Schema Registry REST API
curl -X POST http://schema-registry:8081/subjects/sensor_readings-value/versions   -H "Content-Type: application/vnd.schemaregistry.v1+json"   -d '{
    "schema": "{
      "type": "record",
      "name": "SensorReading",
      "fields": [
        {"name": "sensor_id", "type": "string"},
        {"name": "value", "type": "double"},
        {"name": "unit", "type": "string"},
        {"name": "ts_ms", "type": "long"}
      ]
    }"
  }'

# Check compatibility of a proposed new schema before registering
curl -X POST http://schema-registry:8081/compatibility/subjects/sensor_readings-value/versions/latest   -H "Content-Type: application/vnd.schemaregistry.v1+json"   -d '{ "schema": "..." }'
# Returns: {"is_compatible": true}

Avro vs. JSON Schema vs. Protobuf

The registry supports multiple serialization formats. Avro is the most common for Kafka pipelines — it is compact (no field names in the encoded data, just values in schema-defined order), has strong schema evolution semantics, and has good library support. JSON Schema is easier to read and write but produces larger messages. Protobuf is efficient and widely used in gRPC ecosystems.

For a data engineering pipeline where the priority is storage efficiency and schema governance, Avro is usually the right default. It is what you will see in most Confluent documentation and most mature Kafka implementations.

Integrating With the Three-Stage Pipeline

Schema Registry fits naturally into Stage 2 (deserialize) of the three-stage pipeline. The raw landing job (Stage 1) stores the raw Avro bytes — including the 5-byte header with the schema ID — without touching them. Stage 2 reads those bytes, extracts the schema ID, looks up the schema from the registry, and deserializes. If the schema ID is not in the registry (which should never happen if producers are registering correctly), it is a deserialize error that goes to the dead letter table.

This means the schema registry does not need to be available at ingest time — only at deserialize time. The raw bytes are self-describing via the embedded schema ID. Your ingest job has no dependency on the registry.

If you are building a Kafka pipeline today and you have more than one producer or plan to evolve your message formats, set up schema registry before you go to production. Retrofitting it onto an existing pipeline is painful. Starting with it is not. I am here to help.

Read more