The Bronze-Silver-Gold Architecture: Organizing Your Data Lake the Right Way

One of the first things I noticed when I started working seriously with data lakes is that the same raw data would get processed in slightly different ways by different teams, and the "current version" of a customer record or a product attribute was genuinely hard to determine. Three teams, three pipeline outputs, three answers to "what is this customer's region today?"

The bronze-silver-gold architecture is the most common answer to that problem in the Databricks ecosystem, and understanding it early saves you from building a layout that you'll want to redesign in six months.

Three Layers, Three Contracts

Bronze is raw data, exactly as it arrived. No transformations, no cleaning, no business logic. If you're ingesting from SQL Server, bronze is the exact byte-for-byte copy of the source tables. If you're ingesting from a Kafka topic, bronze is the raw event payload.

Why keep raw data? Because your transformation logic will be wrong in ways you can't anticipate. Having the raw data means you can always reprocess from source rather than starting over from scratch or going back to beg the upstream team for a re-extract.

Silver is cleaned, validated, and conformed data. Nulls handled. Types cast. Business keys resolved. Schema enforced. The same customer ID from three different source systems is resolved to a single canonical ID here. Silver is where your downstream teams should generally start their work.

Gold is aggregated, business-ready data. Metrics tables, fact tables, dimension tables optimized for your BI tool. Gold is typically what your end users query. It can be rebuilt from silver at any time.

What This Looks Like in Practice

-- Bronze: raw ingestion, append-only
CREATE TABLE bronze.orders_raw (
  raw_payload STRING,
  source_system STRING,
  ingested_at TIMESTAMP,
  batch_id STRING
)
USING DELTA
LOCATION '/mnt/myproject/bronze/orders'

-- Silver: cleaned and typed
CREATE TABLE silver.orders (
  order_id BIGINT NOT NULL,
  customer_id BIGINT NOT NULL,
  order_date DATE NOT NULL,
  total_amount DECIMAL(18, 4) NOT NULL,
  status STRING,
  region STRING,
  processed_at TIMESTAMP
)
USING DELTA
LOCATION '/mnt/myproject/silver/orders'

-- Gold: aggregated for reporting
CREATE TABLE gold.daily_order_summary (
  order_date DATE,
  region STRING,
  order_count BIGINT,
  total_revenue DECIMAL(18, 4),
  updated_at TIMESTAMP
)
USING DELTA
LOCATION '/mnt/myproject/gold/daily_order_summary'

Pipeline Flow

Bronze is append-only — you never update or delete bronze data. Every ingestion run adds new rows (or new files). Silver is rebuilt or incrementally updated from bronze based on business rules. Gold is rebuilt from silver on a schedule.

If you discover a bug in your silver transformation, you fix the logic and reprocess bronze into silver. Gold can then be rebuilt from the corrected silver. The bad data never reached bronze, so your lineage is clean.

The SQL Server Analogy

Bronze is your staging tables. Silver is your data warehouse layer. Gold is your data mart. The main difference: in SQL Server you probably rebuilt the staging tables on each run, which made reprocessing hard. Delta's append-only bronze with time travel means you keep the full history without the storage cost explosion, and reprocessing is fast because the raw data is already in your lake.

You don't have to call them bronze, silver, and gold. Raw/clean/curated works. Staging/conformed/aggregated works. The naming is less important than the principle: raw data stays raw, business logic lives in a separate layer, and end users always hit the most-refined layer. As always, I'm here to help.

Read more