I've been containerizing everything in the home lab this year — SQL Server, services, pipelines. The natural next step was getting Great Expectations into a container as well, so validation checks run with the same environment everywhere: my local machine, a CI build agent, a Kubernetes job. No "works on my laptop" problems with Python version mismatches or missing ODBC drivers.
It's straightforward once you've done it once. Here's the setup.
The Dockerfile
FROM python:3.8-slim
WORKDIR /app
# System dependencies for pyodbc (for SQL Server datasources)
RUN apt-get update && apt-get install -y unixodbc-dev gcc && rm -rf /var/lib/apt/lists/*
# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
# Copy the GE project and validation scripts
COPY great_expectations/ ./great_expectations/
COPY scripts/ ./scripts/
CMD ["python", "scripts/validate.py"]
The requirements.txt:
great_expectations==0.11.9
pandas==1.0.3
sqlalchemy==1.3.16
pyodbc==4.0.30
Running Validation as a Container
For file-based validation, mount the data directory into the container:
docker run -v /data/pipeline_output:/app/data:ro -v /logs/ge:/app/great_expectations/uncommitted:rw my-ge-validator python scripts/validate.py data/storm_events_processed.csv storm_clean_suite
For SQL Server datasources, pass credentials as environment variables rather than baking them into the image:
docker run -e SQL_SERVER_HOST=192.168.1.100 -e SQL_SERVER_DB=StormData -e SQL_SERVER_USER=svc_validation -e SQL_SERVER_PASS=secret my-ge-validator python scripts/validate_sql.py dbo.StormEvents storm_sql_suite
And in your validation script, read from environment:
import os
import great_expectations as ge
from great_expectations.datasource import SqlAlchemyDatasource
conn = (
f"mssql+pyodbc://{os.environ['SQL_SERVER_USER']}:{os.environ['SQL_SERVER_PASS']}"
f"@{os.environ['SQL_SERVER_HOST']}/{os.environ['SQL_SERVER_DB']}"
"?driver=ODBC+Driver+17+for+SQL+Server"
)
datasource = SqlAlchemyDatasource(name="sql_server", credentials={"url": conn})
Running as a Kubernetes Job
For pipeline validation on a schedule, a Kubernetes Job is a natural fit — it runs to completion, exits cleanly, and the exit code tells you pass or fail:
apiVersion: batch/v1
kind: Job
metadata:
name: ge-validate-storm-events
spec:
template:
spec:
containers:
- name: validator
image: my-ge-validator:latest
args: ["python", "scripts/validate.py", "data/output.csv", "storm_clean_suite"]
env:
- name: SQL_SERVER_PASS
valueFrom:
secretKeyRef:
name: sql-credentials
key: password
volumeMounts:
- name: pipeline-output
mountPath: /app/data
volumes:
- name: pipeline-output
persistentVolumeClaim:
claimName: pipeline-output-pvc
restartPolicy: Never
Why Containerized Validation Matters
The GE project directory — great_expectations/ with its config, expectation suites, and datasource definitions — travels inside the container. Same suites, same config, same Python version, every run. When validation passes in CI, you know it'll pass with the same container image in production. When it fails, you're not debugging an environment difference — you're debugging a data problem, which is the actual problem you need to solve. As always, I'm here to help.