LLM-Generated Code Still Has to Pass Your CI Pipeline

Six months of heavy Copilot and ChatGPT use has produced a clear pattern: the model generates code that passes visual inspection and fails automated checks. Not because the model is careless — because the model doesn't know your standards. It knows Python. It knows PySpark. It does not know that your team always wraps partition writes in a try/finally that cleans up the staging path, always logs row counts to the audit table, and never uses bare except clauses without at minimum re-raising.

The fix is not better prompting. You can describe your standards in the prompt and the model will respect them inconsistently, especially across a long conversation where the early instructions drift out of effective attention. The fix is the same thing that catches inconsistent human-written code: automated checks that run on every change regardless of who or what wrote it.

LLM adoption is a forcing function to take your CI pipeline seriously. If you haven't already, now is the time.

What "Automated Checks" Means for Data Pipeline Code

The full stack, for a PySpark + Airflow shop:

Linting and style. flake8 or pylint for Python style violations, black for formatting. The model writes code that doesn't follow your line length conventions, uses inconsistent quote styles, and imports things in the wrong order. These are fixable in seconds but they clutter code review. Autoformat on commit; lint on push.

# .pre-commit-config.yaml
repos:
  - repo: https://github.com/psf/black
    rev: 23.1.0
    hooks:
      - id: black
        language_version: python3
  - repo: https://github.com/PyCQA/flake8
    rev: 6.0.0
    hooks:
      - id: flake8
        args: [--max-line-length=100]

Type checking. mypy catches the category of error where the model infers the wrong type for a column or passes the wrong argument type to a function. The model generates a lot of code where a function that expects a DataFrame receives a Column because the model slightly misread the context. mypy catches these before they surface as runtime errors.

DAG integrity tests. Airflow's DagBag import — covered in the 2016 CI/CD post — still applies. LLM-generated DAGs have the same failure modes as human-generated ones: circular dependencies, missing task references, import errors from libraries the model assumed were installed. Run the integrity suite on every push to the dags folder.

Unit tests for transformation logic. This is the non-negotiable one. If the model generated a transformation, the transformation needs a test before it goes to staging. Not because the model is probably wrong — because you need to know definitively that it's right, and visual inspection of LLM-generated business logic is not sufficient evidence.

The Specific Failure Mode That Keeps Appearing

LLM-generated error handling is reliably bad. The model wraps things in try/except blocks where the except swallows exceptions without logging or re-raising. It catches Exception when you'd want AnalysisException. It handles the exception and then continues as if nothing happened, producing a job that silently succeeds on bad data.

# What the model generates:
try:
    df = spark.read.parquet(source_path)
    result = transform(df)
    result.write.mode("overwrite").parquet(target_path)
except Exception:
    pass  # This is a disaster in a pipeline

# What your CI should enforce:
try:
    df = spark.read.parquet(source_path)
    result = transform(df)
    result.write.mode("overwrite").parquet(target_path)
except Exception as e:
    logger.error(f"Pipeline failed for {source_path}: {e}")
    raise  # Always re-raise in a pipeline stage

A lint rule that flags bare except clauses and pass inside except blocks catches this class of error automatically. Add it to your flake8 config and it becomes invisible — the problem never reaches code review.

The Right Mental Model for LLM Code Contributions

Treat LLM-generated code as a PR from an external contributor who is skilled, fast, and unfamiliar with your standards. You wouldn't merge a PR from that contributor without running your full CI suite. Don't merge LLM-generated code without running it either. The contributor is faster than a human junior engineer; the review discipline is the same.

The CI pipeline you built to catch human inconsistency will catch model inconsistency equally well. It turns out the old tools are exactly the right guardrails for the new workflow. As always, I'm here to help.

Read more