Metadata + LLM: The 80% Problem and Why the Last 20% Is the Whole Point
The config-driven pipeline pattern I've been building on since 2015 turns out to be a near-perfect LLM prompt format. Structured, explicit, machine-readable, describes both the data shape and the transformation intent. When I started feeding pipeline configs to GPT-4 alongside source schemas and asking it to generate new configs for similar tables, the results were good enough to be genuinely useful — and wrong in a consistent enough way to reveal exactly where the boundary is.
The boundary is the 80/20 line. The LLM handles 80% correctly and automatically. The remaining 20% requires something the model fundamentally cannot have: the institutional knowledge that lives in the heads of the people who have run this pipeline for two years.
This post is about that 20% — what it is, why it exists, and why optimizing it away is the wrong goal.
What the 80% Looks Like
Feed GPT-4 a well-specified source schema and a description of the transformation, and it produces a correct column mapping, a sensible partition strategy, reasonable filter conditions, and a valid Spark job scaffold. For onboarding a new table that follows the same structural pattern as existing tables in your pipeline inventory, the generated config requires minimal correction.
prompt = f"""
You are generating a pipeline config following this schema:
{json.dumps(CONFIG_SCHEMA, indent=2)}
Here is an example of a correct config for a similar table:
{json.dumps(EXAMPLE_CONFIG, indent=2)}
Generate a config for this new table:
Source schema: {source_schema_description}
Transformation requirements: {requirements}
Output the config as valid JSON only.
"""The output for a well-described, structurally standard table is typically 80–90% correct. Column mappings are right. Partition column inference is right. Filter conditions derived from the description are right. Data type handling follows the pattern. This is a real time saving for routine table onboarding.
What Lives in the Other 20%
Here's where the model consistently falls short, and why:
Null semantics specific to your data. The model sees a nullable column and applies generic null handling. It doesn't know that nulls in your channel_id field mean "direct traffic" in the marketing context, not "missing data" — and that the downstream attribution model depends on that distinction. This is domain knowledge that exists nowhere the model can access it.
Historical quirks in upstream sources. The vendor changed their timestamp format three times. Events before a specific date use a different schema version. One specific event type has a known bug in the source system that requires a correction at ingest. The model generates a clean config assuming a clean source. Your data is not clean.
Downstream consumer requirements. The analytics team built their dashboard assuming a specific column name. The ML team expects this field to be a double, not a long, because of how their training pipeline reads it. The model generates what makes sense for the source; it doesn't know about the downstream contracts that constrain the output.
Business rules that were never written down. The rule that orders from certain test accounts should be excluded. The rule that a specific event type is counted differently for enterprise customers. The rule that existed as a verbal agreement between two people who are no longer on the team. The model cannot generate what was never documented.
Why the Review Step Is Load-Bearing
The instinct when a tool produces 80% correct output is to ask how to get it to 90%, then 95%, then 100%. For LLM pipeline generation, this is the wrong instinct. The 20% that's wrong is wrong because it requires knowledge the model cannot have — and no amount of prompt engineering closes a knowledge gap that exists outside the prompt.
The review step is not an inefficiency to automate away. It's the mechanism by which human judgment applies institutional knowledge to model-generated output. A human reviewing the generated config catches the null semantic error because they know the data. They catch the downstream consumer constraint because they were in the meeting. They catch the undocumented business rule because they remember the incident that created it.
The right framing: the model generates the first draft. You generate the final draft. The time savings are real — a first draft in thirty seconds rather than two hours — but the accountability for correctness stays with you. Trust the model with the 80%. Own the 20%. As always, I'm here to help.