Great Expectations in a Data Science Workflow: Catching Bad Input Before Your Model Sees It

Machine learning models have a talent for finding patterns in garbage data. Give a model training data with a systematic error and it will dutifully learn to predict based on that error. The model validates fine. It performs great in testing. It ships to production. Then the underlying data error gets fixed, and suddenly the model's predictions are wrong in a way nobody can explain.

This is the "garbage in, garbage out" problem at its worst — not the obvious case where your model gets junk and produces obvious nonsense, but the subtle case where the junk is consistent enough to teach the model the wrong thing.

Great Expectations is the right defense. Not after training, not at inference — at the data ingestion step, before the model ever sees the data.

The Three Points to Validate in an ML Pipeline

1. Raw input validation. Does the incoming data match the schema and value ranges you expect? This catches upstream changes before they corrupt your feature engineering.

2. Feature validation. After feature engineering, do your computed features have the statistical properties you'd expect? Are there feature values that shouldn't be mathematically possible? Are distributions dramatically different from your training baseline?

3. Training data consistency. Is the dataset you're about to train on representative of what the model will see in production? This is trickier — but row count, class balance, and key feature distributions are all checkable with expectations.

A Real Example: Hail Prediction Features

For a hail prediction model I'm building on NOAA storm event data, the feature engineering step computes things like storm cell duration, geographic density of events, and radar-derived hail size estimates. Each of these has domain constraints:

import great_expectations as ge

def validate_hail_features(features_df):
    df = ge.from_pandas(features_df)

    # Storm duration must be positive (can't have a negative-duration storm)
    df.expect_column_values_to_be_between('duration_minutes',
        min_value=0, max_value=1440)  # max: 24 hours

    # Hail size in inches — physical constraint
    df.expect_column_values_to_be_between('max_hail_size_inches',
        min_value=0.25, max_value=8.0, mostly=0.99)

    # Geographic coordinates must be valid
    df.expect_column_values_to_be_between('latitude',
        min_value=24.0, max_value=50.0)   # continental US
    df.expect_column_values_to_be_between('longitude',
        min_value=-125.0, max_value=-66.0)

    # Target variable must be binary
    df.expect_column_values_to_be_in_set('is_significant_hail',
        value_set=[0, 1, True, False])

    # Class balance check — if >95% of labels are one class, the model
    # will learn to ignore the minority class
    df.expect_column_proportion_of_unique_values_to_be_between(
        'is_significant_hail', min_value=0.02, max_value=0.98)

    result = df.validate()
    return result

That class balance check is the one that's saved me the most trouble. I once built a 6-month storm event dataset that turned out to have only 2% significant hail events — well below what I expected from historical rates. Turned out a date filter bug was excluding most of the hail season. The model would have trained on a badly skewed dataset and I wouldn't have noticed until the predictions were obviously wrong in production.

Gating the Training Run

def run_training_pipeline(raw_data_path, model_output_path):
    raw_df = load_raw_storm_data(raw_data_path)

    # Gate 1: raw data quality
    raw_result = validate_noaa_input(raw_df)
    if not raw_result['success']:
        raise RuntimeError("Raw data failed quality checks — aborting training")

    features_df = engineer_features(raw_df)

    # Gate 2: feature quality
    feature_result = validate_hail_features(features_df)
    if not feature_result['success']:
        raise RuntimeError("Feature engineering produced invalid output — check transforms")

    # Only train if data passes both gates
    train_model(features_df, model_output_path)

Two explicit gates, both logged, both blocking. If either fails you get a clear error message with validation details rather than a model that silently learned the wrong thing.

The investment is low — writing the expectation suite takes an hour. The payoff is avoiding the situation where you deploy a model that works on corrupted training data and then spend three days figuring out why production predictions don't match your test metrics. As always, I'm here to help.

Read more