Containerizing Data Jobs: What Kubernetes Changes for Pipeline Deployment
Kubernetes 1.8 shipped last month. The container orchestration war is effectively over — Kubernetes won. And while the immediate data engineering story isn't "run all your Spark jobs on Kubernetes starting today," the container model it's standardizing around changes the deployment story for data jobs in ways worth understanding now.
The core problem containers solve for data engineering is the one that's followed us from the on-prem Hadoop days through EMR and into Databricks: environment drift. The job works on the machine you developed it on. It fails in production because of a library version mismatch, a Python version difference, a native dependency that's installed on dev but not prod. Docker's promise — package the code and its complete runtime environment together — is the right answer to this problem, and Kubernetes is becoming the right place to run those packages at scale.
What Containerizing a Spark Job Actually Means
A containerized Spark job is a Docker image that contains: the Spark runtime (or just the driver, depending on your deployment model), your Python or Scala application code, all Python library dependencies (pinned versions), and any native libraries your code requires. The image is built once in CI, pushed to a registry, and deployed by pulling that exact image in every environment. Dev, staging, and production run the same image.
FROM python:3.6-slim
# Pin the Java version — Spark requires Java 8
RUN apt-get update && apt-get install -y default-jdk-headless && rm -rf /var/lib/apt/lists/*
ENV SPARK_VERSION=2.2.0
ENV HADOOP_VERSION=2.7
# Install Spark
RUN curl -s "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}.tgz" \
| tar -xz -C /opt/ && \
mv /opt/spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION} /opt/spark
ENV SPARK_HOME=/opt/spark
ENV PATH=$SPARK_HOME/bin:$PATH
# Install Python dependencies with pinned versions
COPY requirements.txt /app/requirements.txt
RUN pip install -r /app/requirements.txt
# Copy application code
COPY src/ /app/src/
WORKDIR /app
ENTRYPOINT ["spark-submit", "--master", "local[*]", "src/session_pipeline.py"]Build this image in CI, run the test suite inside it (the test environment is now identical to the runtime environment), push to ECR or Docker Hub, and deploy by running that exact image.
Spark on Kubernetes: Where Things Stand
Native Spark on Kubernetes support is in progress upstream and experimental as of late 2017. The SPARK-18278 work is merged to master but hasn't shipped in a stable release yet — Spark 2.3 (early 2018) is expected to include it. Running Spark executors as Kubernetes pods, with the driver coordinating via the Kubernetes API server, is the end state, but it's not production-ready today.
What is production-ready today: running Spark in cluster mode on a fixed cluster (EMR or Databricks), with the Spark job packaged as a Docker container that's invoked by Airflow via a BashOperator or a custom Kubernetes operator. The containerization gives you environment reproducibility; Kubernetes gives you the scheduling and resource isolation layer. The Spark cluster underneath can be your existing infrastructure.
# Airflow DAG invoking a containerized Spark job on Kubernetes
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
run_session_pipeline = KubernetesPodOperator(
task_id="run_session_pipeline",
name="session-pipeline",
namespace="data-jobs",
image="your-registry.com/session-pipeline:{{ var.value.pipeline_version }}",
arguments=["--date", "{{ ds }}"],
env_vars={
"AWS_DEFAULT_REGION": "us-east-1",
"S3_BUCKET": "{{ var.value.data_lake_bucket }}"
},
resources={
"request_memory": "4Gi",
"request_cpu": "2",
"limit_memory": "8Gi"
},
dag=dag
)The Databricks Model vs. the Container Model
If you're running Databricks, the containerization argument is partially addressed by the Databricks Runtime — the managed runtime is Databricks' way of providing environment consistency. Where containerization adds value even on Databricks is for jobs that don't fit cleanly into the Databricks Notebook or Jobs model: custom Python services, data validation jobs, or lightweight processing that doesn't warrant a full Spark cluster. Running these as Kubernetes pods alongside your Databricks Jobs gives you a consistent deployment model across both.
Start With the Dockerfile
You don't need Kubernetes to get value from containerizing your Spark jobs. Start with a Dockerfile that captures your exact runtime environment. Run your CI tests inside it. If the job works in the container in CI, it will work in the container in production. That's the foundational guarantee, and it's available today regardless of whether you're ready to run on Kubernetes.
If you're experimenting with Spark on Kubernetes in late 2017 and hitting the rough edges of the experimental support, I'd like to compare notes. As always, I'm here to help.