Data Modeling Best Practices for Lakehouse Architectures: What Dimensional Modeling Gets Right

Ten years of building data models across different platforms has given me some strong opinions about what works in a lakehouse environment versus what you'll fight against. The lakehouse relaxes a lot of the old constraints — no referential integrity, no NOT NULL enforcement at the engine level, schema-on-read — and that relaxation is a feature, not a bug. But it doesn't mean modeling discipline goes out the window. It means the discipline shows up in your pipeline rather than your DDL.

Principle 1: Name Things for the Query, Not the Source

Source system column names are often terrible: CUST_NM, ORD_DT, AMT_USD, FL_ACTV. In a relational database, a view layer provides human-readable names. In a lakehouse, you have the opportunity to fix the names at the silver layer — and you should.

-- Bronze: preserve source names exactly (for traceability)
-- Silver: rename for clarity
CREATE TABLE prod_analytics.silver.customers
AS
SELECT
CUST_ID AS customer_id,
CUST_NM AS customer_name,
CUST_EMAIL AS email_address,
FL_ACTV AS is_active,
CR_DT AS created_date,
UPD_DT AS last_updated_date
FROM prod_analytics.bronze.raw_customers;

Once you rename at silver, everything downstream uses the clean names consistently. Never rename again between silver and gold.

Principle 2: Partition on What You Filter On

Partitioning in Delta Lake determines which files get scanned for a given query. If you partition by region_code but most queries filter by order_date, you're paying for full table scans. Partition on the column that appears in WHERE clauses on the most critical queries.

-- Good: most queries filter by date range
CREATE TABLE prod_analytics.gold.daily_order_summary
(
order_date DATE,
region_code STRING,
total_revenue DECIMAL(18,2),
order_count INT
)
USING DELTA
PARTITIONED BY (order_date);

-- Z-order on secondary filter columns within each partition
OPTIMIZE prod_analytics.gold.daily_order_summary
ZORDER BY (region_code);

Principle 3: Keep the Grain Clear in the Table Name

The most common modeling confusion I see: a table named orders that turns out to be one row per order line item, not one row per order. Or daily_sales that's actually at the customer-day grain, not the total-day grain. The grain belongs in the table name or comment.

-- Table names that communicate grain
CREATE TABLE prod_analytics.gold.order_line_items -- one row per line item
CREATE TABLE prod_analytics.gold.order_headers -- one row per order
CREATE TABLE prod_analytics.gold.daily_sales_by_region -- one row per date/region
CREATE TABLE prod_analytics.gold.customer_lifetime_summary -- one row per customer

Principle 4: Separate Structure from Logic

Business logic — rebate calculations, allocation rules, complex conditional aggregations — belongs in clearly named columns in transformation code, not embedded in table design. Tables store the result of the logic; they don't implement it. If someone needs to understand why a row has a specific value, they should be able to trace it to a named function or transformation step, not infer it from a table schema.

Principle 5: COMMENT Everything

In Databricks, table and column comments are queryable and appear in Unity Catalog's Catalog Explorer. A comment takes thirty seconds to write and saves the next engineer thirty minutes of reading source code to understand what a column means.

CREATE TABLE prod_analytics.gold.customer_risk_scores (
customer_id STRING COMMENT 'Internal customer identifier from the CRM system',
risk_score DECIMAL COMMENT 'Composite risk score 0.0-1.0; scores above 0.7 trigger manual review',
score_date DATE COMMENT 'Date the score was computed — not the order date',
model_version STRING COMMENT 'MLflow model version used to compute this score'
)
USING DELTA
COMMENT 'Daily customer risk scores — one row per customer per day, computed by the risk pipeline';

As always, I'm here to help.

Read more