Great Expectations SQLAlchemy Integration: Validating SQL Server Data with Python

The Great Expectations posts so far have focused on pandas DataFrames — CSV files, in-memory data, the data science use case. But most of the data I work with doesn't live in CSVs. It lives in SQL Server, or Azure SQL, or PostgreSQL. And I want to run the same expectation-based validation against it without pulling the entire table into memory first.

The SQLAlchemy integration in Great Expectations makes this possible. You connect to your database, define expectations against the table or query, and GE pushes the validation logic down to the database engine rather than pulling data into Python. For large tables, this is the only practical approach.

Connecting a SQL Server DataSource

Great Expectations uses SQLAlchemy under the hood for all SQL-based datasources, which means anything SQLAlchemy supports — SQL Server, PostgreSQL, MySQL, SQLite — works with the same interface. For SQL Server you need pyodbc and the appropriate ODBC driver:

pip install great_expectations sqlalchemy pyodbc

Setting up the datasource:

import great_expectations as ge
from great_expectations.datasource import SqlAlchemyDatasource

# Connection string for SQL Server
conn_string = (
    "mssql+pyodbc://username:password@server/database"
    "?driver=ODBC+Driver+17+for+SQL+Server"
)

# Or using Windows authentication
conn_string = (
    "mssql+pyodbc://server/database"
    "?driver=ODBC+Driver+17+for+SQL+Server&trusted_connection=yes"
)

datasource = SqlAlchemyDatasource(
    name="sql_server_prod",
    credentials={"url": conn_string}
)

# Get a batch — a table or query to validate against
batch = datasource.get_batch({
    "table": "dbo.StormEvents",
    "schema": "dbo"
})

Running Expectations Against a SQL Table

The expectation API is the same as with pandas — GE translates the assertions into SQL queries rather than pandas operations:

batch.expect_column_to_exist('EventID')
batch.expect_column_values_to_not_be_null('EventID')
batch.expect_column_values_to_not_be_null('EventDate')
batch.expect_column_values_to_be_between('Magnitude',
    min_value=0, max_value=12.0, mostly=0.99)
batch.expect_column_values_to_be_in_set('EventType',
    value_set=['Tornado', 'Hail', 'Flash Flood', 'Thunderstorm Wind'])
batch.expect_table_row_count_to_be_between(
    min_value=10000, max_value=5000000)

result = batch.validate()
print(f"Passed: {result['success']}")

For expect_column_values_to_not_be_null on a SQL table, GE generates something like SELECT COUNT(*) FROM dbo.StormEvents WHERE EventID IS NULL. For value ranges it generates a CASE WHEN count query. The logic runs in SQL Server, not in Python memory — critical when you're validating a 50-million-row table.

Validating a Query Result, Not a Full Table

Often you want to validate data as it comes out of a specific query — after a JOIN, with a filter applied, or from a view. You can pass a query instead of a table name:

batch = datasource.get_batch({
    "query": """
        SELECT s.EventID, s.EventDate, s.Magnitude, s.EventType,
               c.CountyName, c.StateName
        FROM dbo.StormEvents s
        JOIN dbo.Counties c ON s.CountyFIPS = c.FIPS
        WHERE s.EventDate >= '2019-01-01'
    """
})

This is particularly useful for validating the output of a stored procedure or a view before downstream consumers hit it — the same "validate at pipeline boundaries" pattern, applied to SQL instead of files.

Bridging the SQL Server and Python Worlds

For most of the data engineering work I do, data lives in SQL Server and gets processed in Python. Being able to run the same expectation framework against both — using pandas for CSV-based ingestion and SQLAlchemy for database-resident data — means one testing discipline, one tool, one suite format. The JSON expectation suite doesn't care whether the data came from a CSV or a SQL query.

If you've been running tSQLt tests on your SQL Server logic and wanting the same thing for the Python layer of your pipeline, this is the connection. The frameworks are different, but the principle — assert your data behaves as specified, in code, automatically — is exactly the same. As always, I'm here to help.

Read more