MLflow for Pipeline Engineers: Tracking Experiments Without Becoming a Data Scientist

If you're building data pipelines in Databricks and you're not a data scientist, MLflow might look like it's not for you. It's in the name: machine learning flow. But the core capability — tracking what ran, with what parameters, and what it produced — is genuinely useful for pipeline engineers. Here's the slice of MLflow that's worth knowing even if you never train a model.

What MLflow Is, From a Pipeline Perspective

MLflow has four components. Three of them are squarely ML tooling (model registry, model serving, MLflow Projects). The one that's useful for everyone is MLflow Tracking: a lightweight experiment logging system where you record parameters, metrics, and artifacts from a run.

For a data pipeline, this translates to: "I ran the daily order processing job. I processed 1.2 million rows. It took 18 minutes. The output table has 987,000 rows after deduplication. I used these partition bounds." That's an experiment log entry. It lives in a centralized location, it's queryable, and you can compare it to yesterday's run.

Logging a Pipeline Run

import mlflow

# Start a run — MLflow creates a unique run ID
with mlflow.start_run(run_name="daily_order_processing_2019-07-12"):

    # Log input parameters
    mlflow.log_param("processing_date", "2019-07-12")
    mlflow.log_param("source_table", "dbo.Orders")
    mlflow.log_param("partition_column", "order_date")

    # ... your pipeline logic here ...
    source_count = 1_234_567
    output_count = 987_432
    duplicates_removed = source_count - output_count
    run_time_minutes = 18.4

    # Log outcome metrics
    mlflow.log_metric("source_row_count", source_count)
    mlflow.log_metric("output_row_count", output_count)
    mlflow.log_metric("duplicates_removed", duplicates_removed)
    mlflow.log_metric("run_time_minutes", run_time_minutes)

    # Log a data quality artifact — a file, a report, a summary
    with open("/tmp/run_summary.txt", "w") as f:
        f.write(f"Processed {output_count} rows in {run_time_minutes} minutes\n")
        f.write(f"Removed {duplicates_removed} duplicates\n")
    mlflow.log_artifact("/tmp/run_summary.txt")

Viewing Run History

Databricks has a built-in MLflow UI (Experiments in the sidebar). Every run appears there with its parameters and metrics. You can compare runs side by side to see why yesterday's job processed 10% fewer rows than last week's, or why today's run took twice as long.

# Query run history programmatically
client = mlflow.tracking.MlflowClient()
runs = client.search_runs(
    experiment_ids=["your-experiment-id"],
    filter_string="params.processing_date >= '2019-07-01'",
    order_by=["start_time DESC"]
)

for run in runs:
    print(run.data.params.get("processing_date"),
          run.data.metrics.get("output_row_count"),
          run.data.metrics.get("run_time_minutes"))

The One Habit That Pays Off

Wrap every production pipeline run in an MLflow run. Log at minimum: input parameters, output row count, run duration, and any data quality metrics you care about. This turns your pipeline from a black box into an observable system. When something goes wrong — and it will — you'll have a run history that tells you exactly when it started going wrong and what changed. As always, I'm here to help.

Read more