Great Expectations: Data Testing for the Python Era

In 2012 I started writing about tSQLt — a framework that brought unit testing discipline to SQL Server stored procedures. The argument was simple: data logic deserves the same testing rigor as application logic. If you'd write a test for a C# method, you should write one for a stored procedure that transforms the same data.

That argument has since migrated to Python, and a new framework just showed up that makes the same case for the pandas era. It's called Great Expectations, and it does for DataFrame-based data pipelines what tSQLt did for SQL Server: gives you a way to assert that your data behaves the way you expect it to, in code, automatically.

What "Expectations" Means in This Context

An expectation in Great Expectations is a verifiable assertion about your data. Not a comment in a README. Not a mental note. An assertion that runs against actual data and returns a pass/fail result with detail about what failed.

The framework ships with a library of built-in expectations covering the most common data quality checks:

  • expect_column_to_exist — the column is present (catches schema drift)
  • expect_column_values_to_not_be_null — nulls aren't allowed
  • expect_column_values_to_be_between — numeric values fall within a range
  • expect_column_values_to_be_in_set — categorical values are from a known list
  • expect_column_values_to_match_regex — string values conform to a pattern
  • expect_table_row_count_to_be_between — the dataset has a plausible number of rows

Each one wraps a check you'd otherwise write as a one-off script and then forget to run.

The Basic API

Great Expectations wraps a pandas DataFrame with assertion methods. You get a PandasDataset that behaves like a DataFrame but with the expectation library attached:

import great_expectations as ge
import pandas as pd

# Load your data into a GE-aware DataFrame
raw = pd.read_csv('storm_events.csv')
df = ge.from_pandas(raw)

# Assert the columns you depend on actually exist and have valid values
df.expect_column_to_exist('event_id')
df.expect_column_values_to_not_be_null('event_id')
df.expect_column_values_to_not_be_null('event_date')
df.expect_column_values_to_be_between('magnitude', min_value=0, max_value=12.0)
df.expect_column_values_to_be_in_set('event_type',
    value_set=['Tornado', 'Hail', 'Flash Flood', 'Thunderstorm Wind'])

# Get the full validation result
result = df.validate()
print(result['success'])        # True or False
print(result['statistics'])     # counts of passed/failed expectations

The validation result tells you which expectations passed, which failed, and for the failures: how many rows violated the expectation and what the violating values looked like. That's the output you need to actually fix the problem — not just "something is wrong."

Why This Matters More Than It Looks

The patterns I've seen in every data engineering shop I've worked with: upstream sends a CSV. Someone imports it. Three weeks later someone notices the numbers are wrong because a column that was supposed to be an integer started arriving as a string and got silently coerced to NaN. Nobody knows when it started. The audit trail is a shrug.

Great Expectations is the fix. You write the expectations once, you run them at ingestion time, and bad data fails loudly instead of propagating silently. The failure message tells you which column violated which expectation and how many rows were affected. You know immediately — not in three weeks.

The connection to tSQLt is direct: same principle, different runtime. If you've been running tSQLt tests on your SQL Server logic since 2012, Great Expectations is the Python-era version of that habit. The framework is new; the discipline isn't.

I'll be digging into this across the next several posts — expectation suites, validation results, and how to integrate checks into a real pipeline. If you're working with pandas-based data pipelines and not testing your assumptions about the incoming data, start here. As always, I'm here to help.

Read more