Using LLMs to Generate Great Expectations Suites: What Works and What Doesn't

The natural question, given that I'm building LLM-powered data tooling at the moment: can you use a language model to generate a Great Expectations suite? Profile a dataset, describe the domain, and have the model produce a meaningful suite that doesn't require hours of manual expectation writing?

I've been experimenting with this. The answer is: partially, in ways that are useful but not the way you might hope.

What LLMs Are Good At Here

Translating business rules into expectations. If you describe the domain in plain English, a capable LLM can suggest the right GE expectations. "An insurance claim amount must be positive, can't exceed $10 million, and claims with zero injury count shouldn't have medical expenses" maps directly to:

validator.expect_column_values_to_be_between("claim_amount", min_value=0.01, max_value=10_000_000)
validator.expect_column_pair_values_a_to_be_greater_than_b(
    "medical_expenses", "injury_count",
    or_equal=True,
    condition_parser="pandas",
    row_condition='injury_count == 0 and medical_expenses > 0'
)

This translation work — from business rule to GE expectation syntax — is exactly the kind of thing a language model handles well. It knows the GE API, it understands the English description, and it can suggest the right expectation type and parameters.

Generating suite boilerplate from schema. Given a table schema or DataFrame column list with types, an LLM can generate a starter suite with existence, not-null, and type-appropriate range expectations much faster than manual writing. It's the same thing the GE Profiler does, but you can supplement it with domain knowledge in the prompt:

prompt = """
Generate a Great Expectations suite for a storm events table with these columns:
- event_id (integer, primary key)
- event_date (date, not null)
- state (string, US state abbreviation)
- event_type (string, one of: Tornado, Hail, Flash Flood, Thunderstorm Wind)
- magnitude (float, hail size in inches or wind speed, domain: 0-200)
- injuries (integer, non-negative)
- deaths (integer, non-negative)

Business rules:
- Events before 1950 are not valid
- Hail magnitude should be between 0.5 and 8.0 inches
- Tornado events should not have magnitude values above 50 (EF scale, not wind speed)
"""

# Model generates GE expectation calls with appropriate parameters

What LLMs Get Wrong

Hallucinated expectation names. Language models will confidently generate calls to GE expectations that don't exist. I've seen models produce expect_column_values_to_be_non_negative (doesn't exist — use expect_column_values_to_be_between(min_value=0)), expect_column_to_have_no_duplicates (doesn't exist — use expect_column_values_to_be_unique), and various other plausible-sounding but incorrect expectation names. Always run generated suites against a sample dataset before trusting them.

Wrong parameter names. The GE API has specific parameter names (min_value, max_value, value_set, mostly) that models sometimes get wrong when generating code. Syntactically valid Python with wrong parameter names fails at runtime, not at generation time.

Missing the business rule nuance. A model can translate "claim amount must be positive" correctly. It can't tell you that your specific client's claim processing system caps at $10M for standard policies but $50M for commercial lines, or that zero-amount claims are valid for denied claims but not for approved ones. Domain knowledge still lives with domain experts, not in the model.

The Right Integration Pattern

The pattern I've found most useful: use an LLM to generate a first draft of the suite from schema + business rule description, then run the draft against a sample dataset to find which expectations error (hallucinated names, wrong parameters) and which fail (ranges too tight, missing values in value_set). The correction loop is fast — much faster than writing the suite from scratch.

The LLM is doing what it's good at: translating from natural language to structured code. The profiler is doing what it's good at: deriving statistical constraints from the actual data. You're doing what only you can do: applying domain knowledge to verify both are correct. As always, I'm here to help.

Read more