Great Expectations and pandas: Validating DataFrames Before They Hit the Database

My data science project this year has me working with NOAA weather data — storm event records going back decades, delivered as CSVs, with all the consistency problems that implies. Field names that changed between years. Magnitude values stored as strings in some vintages and floats in others. Missing state codes. Duplicate event IDs from some sources.

I've been using Great Expectations to gate every ingestion. Here's what that actually looks like in practice.

The Problem With Trusting Upstream Data

Upstream data sources lie. Not intentionally — the NOAA data is government-published and reasonably authoritative — but the format drifts, the schema evolves, and nobody sends a changelog. If you write a pipeline that assumes the schema you saw in 2015 still holds in 2018, you're going to get burned eventually.

The standard response is defensive coding: add type coercions everywhere, wrap column accesses in try/except, write pandas .fillna() calls for every column just in case. This works, but it makes the data problems invisible. Bad data gets silently coerced into something plausible rather than being caught and flagged.

Great Expectations makes the problems visible by moving the assertions to an explicit, runnable layer before your transformation code touches the data.

A Real Ingestion Gate

import great_expectations as ge
import pandas as pd
import sys

def validate_noaa_storm_file(filepath):
    df = ge.read_csv(filepath, low_memory=False)

    # Schema expectations
    for col in ['EVENT_ID', 'STATE', 'YEAR', 'MONTH_NAME', 'EVENT_TYPE',
                'BEGIN_DATE_TIME', 'END_DATE_TIME', 'INJURIES_DIRECT',
                'DEATHS_DIRECT', 'DAMAGE_PROPERTY']:
        df.expect_column_to_exist(col)

    # Critical fields must not be null
    df.expect_column_values_to_not_be_null('EVENT_ID')
    df.expect_column_values_to_not_be_null('STATE')
    df.expect_column_values_to_not_be_null('EVENT_TYPE')

    # EVENT_ID must be unique within a file
    df.expect_column_values_to_be_unique('EVENT_ID')

    # Year must be plausible
    df.expect_column_values_to_be_between('YEAR', min_value=1950, max_value=2030)

    # Injury/death counts must be non-negative integers
    df.expect_column_values_to_be_between('INJURIES_DIRECT', min_value=0)
    df.expect_column_values_to_be_between('DEATHS_DIRECT', min_value=0)

    # Event types must be from the known NOAA set (partial list)
    df.expect_column_values_to_be_in_set('EVENT_TYPE', value_set=[
        'Tornado', 'Hail', 'Flash Flood', 'Thunderstorm Wind',
        'Winter Storm', 'Blizzard', 'Ice Storm', 'Lightning',
        'High Wind', 'Drought', 'Wildfire', 'Excessive Heat'
    ], mostly=0.95)  # allow for new event types NOAA may add

    result = df.validate()
    return result

if __name__ == '__main__':
    filepath = sys.argv[1]
    result = validate_noaa_storm_file(filepath)

    print(f"File: {filepath}")
    print(f"Expectations evaluated: {result['statistics']['evaluated_expectations']}")
    print(f"Passed: {result['statistics']['successful_expectations']}")
    print(f"Failed: {result['statistics']['unsuccessful_expectations']}")

    if not result['success']:
        print("
Failed expectations:")
        for r in result['results']:
            if not r['success']:
                exp_type = r['expectation_config']['expectation_type']
                col = r['expectation_config']['kwargs'].get('column', 'table-level')
                print(f"  [{col}] {exp_type}")
        sys.exit(1)

What This Catches in Practice

Running this against the NOAA storm event files from different years, I've caught:

  • A 2007 vintage file that had EVENT_TYPE using different capitalization ("TORNADO" vs "Tornado")
  • Several files where DAMAGE_PROPERTY was missing entirely
  • Duplicate EVENT_ID values in a handful of files (source error in NOAA's processing)
  • Years where INJURIES_DIRECT contained "K" suffix values ("2K" for 2000) instead of integers

None of these would have raised an error in my transformation code without explicit assertions. They would have silently produced wrong output — wrong join keys, wrong aggregations, wrong model features. The validation gate turns invisible problems into explicit failures with enough detail to diagnose the source.

The Workflow

The pattern I've settled on: validate the raw file → fix or reject → transform → validate the output. The raw validation suite documents what NOAA is supposed to deliver. The output validation suite documents what my transformation is supposed to produce. Both suites live in source control alongside the pipeline code. When the schema changes, updating the suite is a commit with a reviewable diff.

Same discipline as running tSQLt tests before merging a stored procedure change — just applied to the data layer instead of the SQL layer. As always, I'm here to help.

Read more