The built-in expectation library covers the common cases well: nulls, ranges, value sets, row counts, regex patterns. But real data has domain-specific rules that a general-purpose library can't anticipate. A storm event can't end before it starts. A geocoded latitude must correspond to a valid US state. A FIPS county code must exist in a reference table.
Great Expectations lets you write custom expectations for these domain rules, and they integrate cleanly with the rest of the framework — they serialize into your expectation suite, they produce structured results, and they run alongside the built-ins.
The Custom Expectation Interface
A custom expectation is a Python method you add to a class that subclasses ge.dataset.PandasDataset. The framework provides a decorator that handles the plumbing — result serialization, the mostly parameter, column-level metadata — so you just write the logic:
import great_expectations as ge
from great_expectations.dataset import PandasDataset
class StormDataset(PandasDataset):
@PandasDataset.expectation(["column"])
def expect_event_end_not_before_start(self, column_end, column_start):
"""Storm end time must be >= storm start time."""
violations = self[self[column_end] < self[column_start]]
return {
"success": len(violations) == 0,
"result": {
"unexpected_count": len(violations),
"unexpected_percent": len(violations) / len(self) * 100,
"unexpected_values": violations[[column_start, column_end]].head(5).to_dict('records')
}
}
@PandasDataset.expectation(["column"])
def expect_column_values_to_be_valid_fips(self, column):
"""Values must be valid 5-digit US FIPS county codes."""
import re
fips_pattern = re.compile(r'^d{5}$')
valid = self[column].dropna().apply(lambda v: bool(fips_pattern.match(str(v))))
invalid_count = (~valid).sum()
return {
"success": invalid_count == 0,
"result": {
"unexpected_count": int(invalid_count),
"unexpected_percent": invalid_count / len(self) * 100
}
}
Using Custom Expectations in a Suite
import pandas as pd
raw = pd.read_csv('storm_events_2018.csv', parse_dates=['BEGIN_DATE_TIME', 'END_DATE_TIME'])
df = StormDataset(raw)
# Built-in and custom expectations side by side
df.expect_column_values_to_not_be_null('BEGIN_DATE_TIME')
df.expect_column_values_to_not_be_null('END_DATE_TIME')
df.expect_event_end_not_before_start('END_DATE_TIME', 'BEGIN_DATE_TIME')
df.expect_column_values_to_be_valid_fips('CZ_FIPS')
result = df.validate()
print(result['success'])
Domain Logic Belongs in Expectations, Not Transformation Code
The pattern I've found most useful: any time I write a defensive check in transformation code — if start > end: raise ValueError — I ask whether that check belongs in a custom expectation instead. The answer is usually yes, if:
- The rule applies to the source data, not to logic I'm computing
- A violation should halt ingestion, not be silently handled
- I want the rule to be explicitly documented and visible in the suite
Inline defensive checks are invisible. Custom expectations are documented, tested, and produce structured output when they fail. The first time a custom expectation fires and you can see exactly which rows violated the business rule — and how many — you'll stop putting domain logic in transformation code.
As with any framework extension: keep custom expectations focused on a single rule, name them clearly (expect_event_end_not_before_start is self-documenting), and test them like any other code. If your custom expectation has a bug, it'll pass when it should fail — which is worse than having no expectation at all. As always, I'm here to help.