The last few years of home lab work have reinforced a pattern I already believed in: anything that controls system behavior belongs in source control. Kubernetes manifests, Terraform configs, Ansible playbooks, SSDT database projects with tSQLt tests. The question isn't whether to version-control it — it's how.
Great Expectations expectation suites are the data quality equivalent of these configuration-as-code artifacts. They're JSON files. They're human-readable. They're diffable. They belong in the same Git repository as the pipeline code they protect.
What a Suite File Looks Like
When you save an expectation suite, GE writes a JSON file that's more readable than you might expect:
{
"expectation_suite_name": "storm_events_raw",
"expectations": [
{
"expectation_type": "expect_column_to_exist",
"kwargs": {"column": "EVENT_ID"},
"meta": {}
},
{
"expectation_type": "expect_column_values_to_not_be_null",
"kwargs": {"column": "EVENT_ID", "mostly": 1.0},
"meta": {"notes": "Primary key — must never be null"}
},
{
"expectation_type": "expect_column_values_to_be_between",
"kwargs": {
"column": "MAGNITUDE",
"min_value": 0,
"max_value": 12.0,
"mostly": 0.99
},
"meta": {}
},
{
"expectation_type": "expect_table_row_count_to_be_between",
"kwargs": {"min_value": 1000, "max_value": 500000},
"meta": {"notes": "Annual storm event files range from 50k to 200k rows"}
}
],
"meta": {
"great_expectations_version": "0.11.9",
"created_by": "shannon@toyboxcreations.net"
}
}
A pull request that changes max_value from 12.0 to 15.0 on the magnitude expectation is a one-line diff that triggers a code review conversation about whether the data contract changed intentionally.
The Repository Structure
The structure I've settled on keeps suites next to the pipeline code they protect:
storm_pipeline/
great_expectations/
expectations/
storm_events_raw.json # what NOAA delivers
storm_events_processed.json # what our transforms produce
storm_features.json # input to model training
great_expectations.yml
src/
ingest.py
transform.py
features.py
tests/
test_transforms.py # unit tests for Python code
Dockerfile
The expectation suites live in the same repo as the pipeline code. When you update a transform that changes the output schema, the suite update is part of the same commit — or better, the suite update is a failing check that reminds you to update the transform.
Using the meta Field for Documentation
Each expectation in a suite has a meta dictionary for arbitrary annotations. Use it for the "why" behind non-obvious expectations:
df.expect_column_values_to_be_between(
'MAGNITUDE',
min_value=0,
max_value=12.0,
mostly=0.99,
meta={
"notes": "Hail size in inches. Physical upper bound ~8", "
"0.99 mostly allows for 1% of NOAA transcription errors. "
"max_value=12 is deliberately generous — flag it if we see values above 8.",
"last_reviewed": "2020-07-01",
"reviewer": "shannon"
}
)
This is the equivalent of a meaningful comment in code — explaining the why of the constraint, not just restating what the JSON already says. It survives in source control alongside the expectation itself.
Treating Suite Changes Like Schema Changes
A change to an expectation suite is a change to a data contract. It should go through the same review process as a schema change in SSDT or a stored procedure change in your SQL Server project. The commit message should explain what changed and why. The diff should be reviewable. The change should be traceable.
When something goes wrong with data quality, the Git history of your suite files tells you when the contract changed, who changed it, and why. That's the audit trail that makes data quality enforceable rather than aspirational. As always, I'm here to help.