My data science project this year has me working with NOAA weather data — storm event records going back decades, delivered as CSVs, with all the consistency problems that implies. Field names that changed between years. Magnitude values stored as strings in some vintages and floats in others. Missing state codes. Duplicate event IDs from some sources.
I've been using Great Expectations to gate every ingestion. Here's what that actually looks like in practice.
The Problem With Trusting Upstream Data
Upstream data sources lie. Not intentionally — the NOAA data is government-published and reasonably authoritative — but the format drifts, the schema evolves, and nobody sends a changelog. If you write a pipeline that assumes the schema you saw in 2015 still holds in 2018, you're going to get burned eventually.
The standard response is defensive coding: add type coercions everywhere, wrap column accesses in try/except, write pandas .fillna() calls for every column just in case. This works, but it makes the data problems invisible. Bad data gets silently coerced into something plausible rather than being caught and flagged.
Great Expectations makes the problems visible by moving the assertions to an explicit, runnable layer before your transformation code touches the data.
A Real Ingestion Gate
import great_expectations as ge
import pandas as pd
import sys
def validate_noaa_storm_file(filepath):
df = ge.read_csv(filepath, low_memory=False)
# Schema expectations
for col in ['EVENT_ID', 'STATE', 'YEAR', 'MONTH_NAME', 'EVENT_TYPE',
'BEGIN_DATE_TIME', 'END_DATE_TIME', 'INJURIES_DIRECT',
'DEATHS_DIRECT', 'DAMAGE_PROPERTY']:
df.expect_column_to_exist(col)
# Critical fields must not be null
df.expect_column_values_to_not_be_null('EVENT_ID')
df.expect_column_values_to_not_be_null('STATE')
df.expect_column_values_to_not_be_null('EVENT_TYPE')
# EVENT_ID must be unique within a file
df.expect_column_values_to_be_unique('EVENT_ID')
# Year must be plausible
df.expect_column_values_to_be_between('YEAR', min_value=1950, max_value=2030)
# Injury/death counts must be non-negative integers
df.expect_column_values_to_be_between('INJURIES_DIRECT', min_value=0)
df.expect_column_values_to_be_between('DEATHS_DIRECT', min_value=0)
# Event types must be from the known NOAA set (partial list)
df.expect_column_values_to_be_in_set('EVENT_TYPE', value_set=[
'Tornado', 'Hail', 'Flash Flood', 'Thunderstorm Wind',
'Winter Storm', 'Blizzard', 'Ice Storm', 'Lightning',
'High Wind', 'Drought', 'Wildfire', 'Excessive Heat'
], mostly=0.95) # allow for new event types NOAA may add
result = df.validate()
return result
if __name__ == '__main__':
filepath = sys.argv[1]
result = validate_noaa_storm_file(filepath)
print(f"File: {filepath}")
print(f"Expectations evaluated: {result['statistics']['evaluated_expectations']}")
print(f"Passed: {result['statistics']['successful_expectations']}")
print(f"Failed: {result['statistics']['unsuccessful_expectations']}")
if not result['success']:
print("
Failed expectations:")
for r in result['results']:
if not r['success']:
exp_type = r['expectation_config']['expectation_type']
col = r['expectation_config']['kwargs'].get('column', 'table-level')
print(f" [{col}] {exp_type}")
sys.exit(1)
What This Catches in Practice
Running this against the NOAA storm event files from different years, I've caught:
- A 2007 vintage file that had
EVENT_TYPEusing different capitalization ("TORNADO" vs "Tornado") - Several files where
DAMAGE_PROPERTYwas missing entirely - Duplicate
EVENT_IDvalues in a handful of files (source error in NOAA's processing) - Years where
INJURIES_DIRECTcontained "K" suffix values ("2K" for 2000) instead of integers
None of these would have raised an error in my transformation code without explicit assertions. They would have silently produced wrong output — wrong join keys, wrong aggregations, wrong model features. The validation gate turns invisible problems into explicit failures with enough detail to diagnose the source.
The Workflow
The pattern I've settled on: validate the raw file → fix or reject → transform → validate the output. The raw validation suite documents what NOAA is supposed to deliver. The output validation suite documents what my transformation is supposed to produce. Both suites live in source control alongside the pipeline code. When the schema changes, updating the suite is a commit with a reviewable diff.
Same discipline as running tSQLt tests before merging a stored procedure change — just applied to the data layer instead of the SQL layer. As always, I'm here to help.