The Case for Declarative Data Transformation

There's a tool called dbt — data build tool — that Fishtown Analytics has been developing and using with clients for the last several months. It's not widely known yet. I think it's going to matter, so I want to explain the core idea before it arrives in your inbox as a vendor pitch.

The premise is simple: write your data transformations as SELECT statements. dbt handles the rest — the DDL to create or replace the output tables, the dependency ordering between transformations, the schema tests, the documentation. You describe what the data should look like in SQL; dbt figures out how to build it.

That's the declarative shift. And it's the right model for the analytics transformation layer.

The Imperative Alternative and Its Problems

The current standard for analytics transformations is some combination of: Spark jobs that read one table, apply transformations, and write another table; Hive scripts that do the same; or stored procedures in Redshift or Snowflake that chain together via a scheduler. These are all imperative — they describe how to compute the output, step by step, with explicit CREATE TABLE and INSERT INTO statements wrapped around the logic.

The problem is maintenance. Imperative transformation code mixes the business logic (the interesting part) with the infrastructure scaffolding (CREATE TABLE IF NOT EXISTS, DROP TABLE IF EXISTS, INSERT OVERWRITE, MSCK REPAIR TABLE). When you want to understand what a transformation does, you read through the scaffolding to find the logic. When you want to change the output schema, you change the CREATE TABLE and potentially every INSERT that references it. When you add a new transformation that depends on an existing one, you manually verify the dependency order in your scheduler.

The dbt Model

A dbt model is a SQL file containing a SELECT statement. Nothing else. No DDL, no INSERT, no scaffolding.

-- models/session_events.sql
-- This is the complete dbt model. dbt generates the CREATE TABLE AS SELECT.

SELECT
    e.user_id,
    e.session_id,
    e.page_url,
    e.event_ts,
    u.signup_date,
    u.account_tier,
    DATEDIFF('day', u.signup_date, CAST(e.event_ts AS DATE)) AS days_since_signup
FROM {{ ref('raw_page_views') }} e
LEFT JOIN {{ ref('dim_users') }} u ON e.user_id = u.user_id
WHERE e.event_ts IS NOT NULL

The {{ ref('raw_page_views') }} syntax is dbt's dependency declaration. dbt parses all the ref() calls across your models, builds a DAG, and runs models in the correct dependency order. You don't maintain the dependency order in your scheduler — dbt derives it from the SQL itself.

Tests That Run Against the Data

dbt has a testing model built in. You declare expected properties of your output columns in a YAML file, and dbt generates and runs the test queries for you:

# models/schema.yml
models:
  - name: session_events
    columns:
      - name: user_id
        tests:
          - not_null
          - relationships:
              to: ref('dim_users')
              field: user_id
      - name: session_id
        tests:
          - not_null
      - name: days_since_signup
        tests:
          - not_null
          - accepted_values:
              values: [0, 1, 2, 3, 4, 5, 6, 7]  # first week only

Run dbt test and dbt executes these as SQL queries against your actual data. Referential integrity violations, unexpected NULLs, out-of-range values — all surface as test failures before the data reaches downstream consumers.

Where This Fits and Where It Doesn't

dbt is the right tool for the analytics transformation layer — the SQL-resident transformations that build marts and aggregates from a processed source layer. It is not a Spark replacement. Complex transformations involving Python UDFs, ML feature engineering, or multi-terabyte joins that benefit from Spark's distributed shuffle do not belong in dbt.

The right architecture is Spark (or Databricks) for the heavy processing and format normalization layer, dbt for the analytics transformation layer above it. Spark does what SQL can't; dbt makes the SQL layer maintainable and testable. They compose well.

If you're evaluating dbt and want to talk through where it fits in your specific stack, reach out. As always, I'm here to help.

Read more