Your First Expectation Suite: Organizing Data Quality Checks in Great Expectations

Running individual expectations against a DataFrame is useful, but it doesn't scale. When you have 20 expectations spread across a data pipeline script, they're going to get lost in the noise. What you need is a way to group expectations into a reusable, named, version-controllable artifact that travels with the pipeline.

Great Expectations calls this an expectation suite.

What an Expectation Suite Is

An expectation suite is a named, serializable collection of expectations. Instead of running assertions inline in your pipeline script, you define a suite once, save it as JSON, and run it against any compatible dataset on demand. The suite is separate from the data and separate from the pipeline — it's the data contract expressed as a file you can check into source control.

Building a Suite

The workflow: create a GE DataFrame, add your expectations, then save the suite to disk.

import great_expectations as ge

df = ge.read_csv('storm_events_sample.csv')

# Define expectations — these describe what "valid data" means for this dataset
df.expect_column_to_exist('event_id')
df.expect_column_to_exist('event_date')
df.expect_column_to_exist('state')
df.expect_column_to_exist('event_type')
df.expect_column_to_exist('magnitude')

df.expect_column_values_to_not_be_null('event_id')
df.expect_column_values_to_not_be_null('event_date')
df.expect_column_values_to_not_be_null('state')

df.expect_column_values_to_be_in_set('event_type',
    value_set=['Tornado', 'Hail', 'Flash Flood', 'Thunderstorm Wind', 'Winter Storm'])

df.expect_column_values_to_be_between('magnitude',
    min_value=0, max_value=12.0, mostly=0.99)  # 99% of rows must pass

df.expect_table_row_count_to_be_between(min_value=1000, max_value=500000)

# Save the suite to a JSON file
df.save_expectation_suite('storm_events_suite.json')

That JSON file is the artifact. It's human-readable, diffable in a pull request, and runnable against any future delivery of the same dataset.

The mostly Parameter

Notice mostly=0.99 on the magnitude expectation. This is one of the more practically useful features in Great Expectations. Real-world data is rarely perfectly clean. If you know that 99% of your magnitude values should be in range but 1% might be missing or malformed from the source, you can encode that tolerance directly in the expectation rather than either ignoring the problem or failing the entire pipeline on a handful of edge cases.

Use mostly intentionally. It's not a way to ignore problems — it's a way to document the known noise floor of your data and catch when it gets worse.

Running a Saved Suite Against New Data

import great_expectations as ge

# New data arrives — run the saved suite against it
new_df = ge.read_csv('storm_events_2017_q3.csv')
result = new_df.validate(expectation_suite='storm_events_suite.json')

if not result['success']:
    # Log the failure details and halt ingestion
    for exp_result in result['results']:
        if not exp_result['success']:
            print(f"FAILED: {exp_result['expectation_config']['expectation_type']}")
            print(f"  Column: {exp_result['expectation_config']['kwargs'].get('column')}")
            print(f"  Unexpected count: {exp_result['result'].get('unexpected_count')}")

Suite as Source Control Artifact

The JSON suite file belongs in source control alongside the pipeline code that consumes the data. When the data contract changes — a new column arrives, a value set expands, a range shifts — the suite update is a pull request with a reviewable diff, not a tribal knowledge change that lives only in someone's mental model of the data.

This is the same discipline as SSDT database projects with tSQLt test files: the tests live next to the code, they're versioned, they're reviewable, they tell you when behavior changes. Great Expectations applies that discipline to the data layer. As always, I'm here to help.

Read more