Great Expectations and Docker: Portable Data Quality Checks in Containers

I've been containerizing everything in the home lab this year — SQL Server, services, pipelines. The natural next step was getting Great Expectations into a container as well, so validation checks run with the same environment everywhere: my local machine, a CI build agent, a Kubernetes job. No "works on my laptop" problems with Python version mismatches or missing ODBC drivers.

It's straightforward once you've done it once. Here's the setup.

The Dockerfile

FROM python:3.8-slim

WORKDIR /app

# System dependencies for pyodbc (for SQL Server datasources)
RUN apt-get update && apt-get install -y     unixodbc-dev     gcc     && rm -rf /var/lib/apt/lists/*

# Python dependencies
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

# Copy the GE project and validation scripts
COPY great_expectations/ ./great_expectations/
COPY scripts/ ./scripts/

CMD ["python", "scripts/validate.py"]

The requirements.txt:

great_expectations==0.11.9
pandas==1.0.3
sqlalchemy==1.3.16
pyodbc==4.0.30

Running Validation as a Container

For file-based validation, mount the data directory into the container:

docker run     -v /data/pipeline_output:/app/data:ro     -v /logs/ge:/app/great_expectations/uncommitted:rw     my-ge-validator     python scripts/validate.py data/storm_events_processed.csv storm_clean_suite

For SQL Server datasources, pass credentials as environment variables rather than baking them into the image:

docker run     -e SQL_SERVER_HOST=192.168.1.100     -e SQL_SERVER_DB=StormData     -e SQL_SERVER_USER=svc_validation     -e SQL_SERVER_PASS=secret     my-ge-validator     python scripts/validate_sql.py dbo.StormEvents storm_sql_suite

And in your validation script, read from environment:

import os
import great_expectations as ge
from great_expectations.datasource import SqlAlchemyDatasource

conn = (
    f"mssql+pyodbc://{os.environ['SQL_SERVER_USER']}:{os.environ['SQL_SERVER_PASS']}"
    f"@{os.environ['SQL_SERVER_HOST']}/{os.environ['SQL_SERVER_DB']}"
    "?driver=ODBC+Driver+17+for+SQL+Server"
)
datasource = SqlAlchemyDatasource(name="sql_server", credentials={"url": conn})

Running as a Kubernetes Job

For pipeline validation on a schedule, a Kubernetes Job is a natural fit — it runs to completion, exits cleanly, and the exit code tells you pass or fail:

apiVersion: batch/v1
kind: Job
metadata:
  name: ge-validate-storm-events
spec:
  template:
    spec:
      containers:
      - name: validator
        image: my-ge-validator:latest
        args: ["python", "scripts/validate.py", "data/output.csv", "storm_clean_suite"]
        env:
        - name: SQL_SERVER_PASS
          valueFrom:
            secretKeyRef:
              name: sql-credentials
              key: password
        volumeMounts:
        - name: pipeline-output
          mountPath: /app/data
      volumes:
      - name: pipeline-output
        persistentVolumeClaim:
          claimName: pipeline-output-pvc
      restartPolicy: Never

Why Containerized Validation Matters

The GE project directory — great_expectations/ with its config, expectation suites, and datasource definitions — travels inside the container. Same suites, same config, same Python version, every run. When validation passes in CI, you know it'll pass with the same container image in production. When it fails, you're not debugging an environment difference — you're debugging a data problem, which is the actual problem you need to solve. As always, I'm here to help.

Read more