Unity Catalog Lineage: What It Tracks and What You Still Need to Instrument Yourself

Unity Catalog tracks lineage automatically for SQL queries and DataFrames run in notebooks and jobs attached to a UC metastore. You run a query, UC records which tables were read and which were written, and that information shows up in the Catalog Explorer lineage graph. For straightforward table-to-table transformations, this works without any configuration on your part.

Here's what it actually captures, what it misses, and what you'll need to instrument yourself.

What UC Lineage Tracks Automatically

For SQL: any SELECT, INSERT, CREATE TABLE AS SELECT, or MERGE that runs in a UC-enabled workspace is captured. The lineage is at the column level — UC tracks that orders.total_amount fed daily_summary.revenue through a specific query.

For Python DataFrames: UC captures table-level lineage for DataFrame reads and writes. Column-level lineage for DataFrame operations is not captured automatically — that requires SQL or the lineage enrichment APIs.

-- This query generates column-level lineage automatically
-- UC records: orders.order_amount → daily_summary.total_revenue
CREATE TABLE prod_analytics.gold.daily_summary
AS
SELECT
order_date,
region_code,
SUM(order_amount) AS total_revenue,
COUNT(DISTINCT customer_id) AS unique_customers
FROM prod_analytics.silver.orders
GROUP BY order_date, region_code;

Viewing Lineage Programmatically

import requests

DATABRICKS_HOST = "https://your-workspace.azuredatabricks.net"
TOKEN = "dapi..."

def get_table_lineage(catalog: str, schema: str, table: str) -> dict:
resp = requests.get(
f"{DATABRICKS_HOST}/api/2.0/lineage-tracking/table-lineage",
headers={"Authorization": f"Bearer {TOKEN}"},
params={
"table_name": f"{catalog}.{schema}.{table}",
"include_entity_lineage": True
}
)
resp.raise_for_status()
return resp.json()

# Get what feeds into the daily_summary table
lineage = get_table_lineage("prod_analytics", "gold", "daily_summary")
print("Upstream tables:")
for upstream in lineage.get('upstreams', []):
print(f" {upstream['tableInfo']['name']}")

What UC Lineage Misses

The gaps are real and worth knowing before you rely on the lineage graph:

Dynamic SQL and metadata-driven pipelines — if your pipeline builds a query string at runtime and executes it via spark.sql(query_string), UC captures the result (which tables were written) but may not capture the full upstream dependencies. The lineage engine parses static SQL; it cannot trace through string concatenation.

External writes — data written to external tables or paths via JDBC, file writes, or third-party connectors outside the Databricks compute environment is not tracked.

Cross-workspace lineage — lineage tracks within a metastore. If data moves from one metastore to another (Delta Sharing, data copies between regions), the lineage graph doesn't cross that boundary.

Adding Custom Lineage for Gaps

def record_custom_lineage(
source_table: str,
target_table: str,
transformation_type: str
) -> None:
# UC doesn't have a public API for custom lineage injection yet (as of mid-2022)
# The workaround: log to a lineage tracking table in your metastore
spark.sql(f"""
INSERT INTO prod_analytics.meta.custom_lineage_log
VALUES (
'{source_table}',
'{target_table}',
'{transformation_type}',
current_timestamp(),
current_user()
)
""")

The automated lineage is genuinely useful for the standard transformation patterns. The gaps matter for the complex patterns — metadata-driven pipelines, parameterized dynamic SQL, and cross-system data flows are exactly the patterns where data engineers most need lineage, and exactly the patterns that are hardest to capture automatically. Supplement with custom logging for those cases. As always, I'm here to help.

Read more