Delta Lake in Production: The Six Things I Monitor Every Day
When I'm asked how production Databricks pipelines are going, my honest answer is "I look at six things." Not dashboards full of metrics I don't read. Six specific things that tell me quickly whether the pipeline is healthy, whether the data is good, and whether something needs attention. Here's the list.
1. Row Count vs Yesterday
For every pipeline run, I log the output row count to MLflow. The single most useful check: is today's output count within 5% of yesterday's? A 40% drop in row count is almost always a problem — missing source data, a failed upstream job, a filter that got too aggressive. A 40% increase might also be a problem — duplicate processing, a watermark that reset.
import mlflow
with mlflow.start_run(run_name=f"orders_daily_{processing_date}"):
output_count = silver_df.count()
mlflow.log_metric("output_row_count", output_count)
# Compare to previous run
prev_count = get_previous_run_metric("output_row_count")
pct_change = abs(output_count - prev_count) / prev_count * 100
mlflow.log_metric("pct_change_from_prior_run", pct_change)
if pct_change > 20:
print(f"WARNING: Row count changed by {pct_change:.1f}% from prior run")2. Null Rates on Key Columns
Some columns should never be null. A null customer_id in the orders table is always wrong. I check null rates on a fixed set of key columns after every silver write:
from pyspark.sql.functions import col, sum as spark_sum, count
null_check = silver_df.select(
spark_sum(col("customer_id").isNull().cast("int")).alias("null_customer_id"),
spark_sum(col("order_date").isNull().cast("int")).alias("null_order_date"),
count("*").alias("total_rows")
).collect()[0]
null_rate = null_check.null_customer_id / null_check.total_rows
if null_rate > 0.001: # more than 0.1% nulls is worth investigating
raise ValueError(f"Unexpected null rate on customer_id: {null_rate:.2%}")
3. DESCRIBE DETAIL File Count
File count creep is slow and silent until it becomes a performance problem. I check it weekly on high-traffic tables:
detail = spark.sql("DESCRIBE DETAIL silver.orders").collect()[0]
avg_mb = (detail.sizeInBytes / detail.numFiles) / 1_048_576
if avg_mb < 50: # files under 50MB suggest small-files accumulation
print(f"Small files warning: {detail.numFiles} files, avg {avg_mb:.1f} MB")4. Job Run Duration Trend
A pipeline that ran in 18 minutes last month and now takes 45 minutes didn't suddenly get slower — it got slower gradually. MLflow run duration metrics, plotted over time, show the trend before it becomes a crisis.
5. Streaming Consumer Lag
For Kafka-to-Delta streaming pipelines, consumer lag is the health metric. If lag is growing over time, the consumer isn't keeping up with the producer. I check this via the Kafka consumer group metrics, not in Databricks itself.
6. VACUUM Schedule
I verify weekly that the VACUUM job ran successfully and check that table sizes aren't growing faster than expected due to skipped VACUUM runs. Not exciting, but letting VACUUM fall behind compounds into a maintenance headache.
Six things. Most of them are under 5 minutes to check. The discipline is running them consistently, not heroically. As always, I'm here to help.