MLflow for Pipeline Engineers: Tracking Experiments Without Becoming a Data Scientist
If you're building data pipelines in Databricks and you're not a data scientist, MLflow might look like it's not for you. It's in the name: machine learning flow. But the core capability — tracking what ran, with what parameters, and what it produced — is genuinely useful for pipeline engineers. Here's the slice of MLflow that's worth knowing even if you never train a model.
What MLflow Is, From a Pipeline Perspective
MLflow has four components. Three of them are squarely ML tooling (model registry, model serving, MLflow Projects). The one that's useful for everyone is MLflow Tracking: a lightweight experiment logging system where you record parameters, metrics, and artifacts from a run.
For a data pipeline, this translates to: "I ran the daily order processing job. I processed 1.2 million rows. It took 18 minutes. The output table has 987,000 rows after deduplication. I used these partition bounds." That's an experiment log entry. It lives in a centralized location, it's queryable, and you can compare it to yesterday's run.
Logging a Pipeline Run
import mlflow
# Start a run — MLflow creates a unique run ID
with mlflow.start_run(run_name="daily_order_processing_2019-07-12"):
# Log input parameters
mlflow.log_param("processing_date", "2019-07-12")
mlflow.log_param("source_table", "dbo.Orders")
mlflow.log_param("partition_column", "order_date")
# ... your pipeline logic here ...
source_count = 1_234_567
output_count = 987_432
duplicates_removed = source_count - output_count
run_time_minutes = 18.4
# Log outcome metrics
mlflow.log_metric("source_row_count", source_count)
mlflow.log_metric("output_row_count", output_count)
mlflow.log_metric("duplicates_removed", duplicates_removed)
mlflow.log_metric("run_time_minutes", run_time_minutes)
# Log a data quality artifact — a file, a report, a summary
with open("/tmp/run_summary.txt", "w") as f:
f.write(f"Processed {output_count} rows in {run_time_minutes} minutes\n")
f.write(f"Removed {duplicates_removed} duplicates\n")
mlflow.log_artifact("/tmp/run_summary.txt")Viewing Run History
Databricks has a built-in MLflow UI (Experiments in the sidebar). Every run appears there with its parameters and metrics. You can compare runs side by side to see why yesterday's job processed 10% fewer rows than last week's, or why today's run took twice as long.
# Query run history programmatically
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(
experiment_ids=["your-experiment-id"],
filter_string="params.processing_date >= '2019-07-01'",
order_by=["start_time DESC"]
)
for run in runs:
print(run.data.params.get("processing_date"),
run.data.metrics.get("output_row_count"),
run.data.metrics.get("run_time_minutes"))The One Habit That Pays Off
Wrap every production pipeline run in an MLflow run. Log at minimum: input parameters, output row count, run duration, and any data quality metrics you care about. This turns your pipeline from a black box into an observable system. When something goes wrong — and it will — you'll have a run history that tells you exactly when it started going wrong and what changed. As always, I'm here to help.