The Metadata Layer Problem: What Your Pipelines Should Know About Themselves

A mature data pipeline knows more than how to move data. It knows what data it read, what it produced, how that output relates to what it produced yesterday, and what would break if the upstream source changed its schema. Most pipelines today know almost none of this about themselves — and the cost of that ignorance shows up in debugging time, in stakeholder trust, and in the engineering hours spent reconstructing context that the pipeline should have been capturing all along.

This is the metadata layer problem. The tooling to solve it is just starting to emerge in 2017, but the patterns are clear enough to implement now without waiting for the tools to mature.

What Your Pipeline Should Know About Itself

Four categories of self-knowledge matter for a production data pipeline:

Job-level execution metadata. What ran, when it started, when it finished, how many records it processed, how many records it rejected, and whether it succeeded. This sounds obvious but most pipelines log it inconsistently — some jobs log it to stdout where it's captured and forgotten, some don't log it at all.

Data lineage. Which upstream tables or files this job read, and which downstream tables or files it produced. Lineage lets you answer "if this source table is corrupted, what downstream outputs are affected?" without tracing code by hand.

Schema fingerprinting. A hash or structured record of the schema the job read from each source at execution time. If the source schema changes between runs, the fingerprint change surfaces it. This is the early warning system for the silent schema mismatch failures described in the 2013 Hive Metastore post.

Output quality signals. Row counts, null rates, value distribution summaries for key columns. Not full statistical profiling on every run — just enough to detect that something changed significantly from the previous run.

Implementing an Audit Table

The simplest version of this requires no specialized tooling. A single audit table captures the execution metadata and quality signals for every job run:

from pyspark.sql import SparkSession
from datetime import datetime
import json

def run_pipeline_with_audit(spark: SparkSession, job_name: str, execution_date: str):
    audit = {
        "job_name": job_name,
        "execution_date": execution_date,
        "start_time": datetime.utcnow().isoformat(),
        "status": "running",
        "input_sources": [],
        "output_targets": [],
        "record_counts": {}
    }

    try:
        # Read source with schema capture
        source_df = spark.read.parquet(
            f"s3://data-lake/processed/events/event_date={execution_date}/"
        )
        audit["input_sources"].append({
            "path": f"s3://data-lake/processed/events/event_date={execution_date}/",
            "schema_fields": [f.name for f in source_df.schema.fields],
            "row_count": source_df.count()
        })

        # Transform
        result_df = transform(source_df)

        # Write output
        output_path = f"s3://data-lake/sessions/event_date={execution_date}/"
        result_df.write.mode("overwrite").parquet(output_path)

        audit["output_targets"].append({
            "path": output_path,
            "row_count": result_df.count()
        })
        audit["status"] = "success"

    except Exception as e:
        audit["status"] = "failed"
        audit["error"] = str(e)
        raise
    finally:
        audit["end_time"] = datetime.utcnow().isoformat()
        # Write audit record
        audit_df = spark.createDataFrame([audit])
        audit_df.write.mode("append") \
            .json("s3://data-lake/pipeline-audit/")

    return result_df

The Tooling Landscape in 2017

Apache Atlas (incubating) is the most fully-formed lineage and metadata tool available right now. It integrates with Hive, Kafka, and Falcon to capture lineage automatically. The setup is non-trivial — Atlas requires its own cluster and a graph database (it ships with JanusGraph) — but for organizations where data governance and compliance make lineage a hard requirement, it's the most complete option available today.

The AWS Glue Data Catalog, covered in December, handles schema management and discovery but not job-level lineage. It tells you what tables exist and what their schemas are; it doesn't tell you which job created a given table partition or what it read to produce it.

For most teams in 2017, the practical answer is the audit table pattern above: instrument your pipelines to write structured execution metadata to a persistent store, and build the lineage query layer on top of that. It's less complete than Atlas but deployable in a day and queryable with standard SQL.

If you're building a metadata capture layer for your pipelines and have patterns worth comparing, I'd like to hear about it. As always, I'm here to help.

Read more