Delta Lake in Production: The Six Things I Monitor Every Day

Shannon Lowder

14 Dec 2019 — 2 min read

When I'm asked how production Databricks pipelines are going, my honest answer is "I look at six things." Not dashboards full of metrics I don't read. Six specific things that tell me quickly whether the pipeline is healthy, whether the data is good, and whether something needs attention. Here's the list.

1. Row Count vs Yesterday

For every pipeline run, I log the output row count to MLflow. The single most useful check: is today's output count within 5% of yesterday's? A 40% drop in row count is almost always a problem — missing source data, a failed upstream job, a filter that got too aggressive. A 40% increase might also be a problem — duplicate processing, a watermark that reset.

import mlflow

with mlflow.start_run(run_name=f"orders_daily_{processing_date}"):
    output_count = silver_df.count()
    mlflow.log_metric("output_row_count", output_count)

    # Compare to previous run
    prev_count = get_previous_run_metric("output_row_count")
    pct_change = abs(output_count - prev_count) / prev_count * 100
    mlflow.log_metric("pct_change_from_prior_run", pct_change)

    if pct_change > 20:
        print(f"WARNING: Row count changed by {pct_change:.1f}% from prior run")

2. Null Rates on Key Columns

Some columns should never be null. A null customer_id in the orders table is always wrong. I check null rates on a fixed set of key columns after every silver write:

from pyspark.sql.functions import col, sum as spark_sum, count

null_check = silver_df.select(
spark_sum(col("customer_id").isNull().cast("int")).alias("null_customer_id"),
spark_sum(col("order_date").isNull().cast("int")).alias("null_order_date"),
count("*").alias("total_rows")
).collect()[0]

null_rate = null_check.null_customer_id / null_check.total_rows
if null_rate > 0.001: # more than 0.1% nulls is worth investigating
raise ValueError(f"Unexpected null rate on customer_id: {null_rate:.2%}")

3. DESCRIBE DETAIL File Count

File count creep is slow and silent until it becomes a performance problem. I check it weekly on high-traffic tables:

detail = spark.sql("DESCRIBE DETAIL silver.orders").collect()[0]
avg_mb = (detail.sizeInBytes / detail.numFiles) / 1_048_576
if avg_mb < 50:  # files under 50MB suggest small-files accumulation
    print(f"Small files warning: {detail.numFiles} files, avg {avg_mb:.1f} MB")

4. Job Run Duration Trend

A pipeline that ran in 18 minutes last month and now takes 45 minutes didn't suddenly get slower — it got slower gradually. MLflow run duration metrics, plotted over time, show the trend before it becomes a crisis.

5. Streaming Consumer Lag

For Kafka-to-Delta streaming pipelines, consumer lag is the health metric. If lag is growing over time, the consumer isn't keeping up with the producer. I check this via the Kafka consumer group metrics, not in Databricks itself.

6. VACUUM Schedule

I verify weekly that the VACUUM job ran successfully and check that table sizes aren't growing faster than expected due to skipped VACUUM runs. Not exciting, but letting VACUUM fall behind compounds into a maintenance headache.

Six things. Most of them are under 5 minutes to check. The discipline is running them consistently, not heroically. As always, I'm here to help.

Delta Lake in Production: The Six Things I Monitor Every Day

Shannon Lowder

1. Row Count vs Yesterday

2. Null Rates on Key Columns

3. DESCRIBE DETAIL File Count

4. Job Run Duration Trend

5. Streaming Consumer Lag

6. VACUUM Schedule

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving