It Worked in Dev: Why Hadoop Pipeline Deploys Need a Repeatable Process

The dev cluster said the job worked. The UAT cluster said the job worked. Then we deployed to production and the job failed in under four minutes with a ClassNotFoundException that had never appeared in either environment.

What was different? On dev, the team had manually installed a newer version of a shared library directly on the DataNodes. On UAT, someone had done the same thing three months earlier for a different project. Production was on the original version from the cluster setup. The job had been depending on that library the whole time — nobody knew because the deployment process was "copy the JAR, run the job, hope."

CI/CD exists precisely to prevent this failure mode. Data teams are largely ignoring it.

The Manual Deploy Problem

The standard Hadoop pipeline deployment workflow in most shops right now looks roughly like this:

  1. Engineer finishes a change on their local machine or dev cluster.
  2. Engineer SCPs the JAR (or the Hive script, or the Pig file) to a shared location.
  3. Engineer updates the Oozie workflow XML — maybe through the Hue UI, maybe by editing the file directly on HDFS.
  4. Engineer triggers a test run and watches the JobTracker.
  5. Engineer emails the team that the deployment is done.

Every step in that list is manual, undocumented, and environment-specific. Step 3 is the worst: editing workflow configuration directly in HDFS means there's no history, no review, and no way to diff what changed between last Tuesday's version and today's.

What CI/CD Looks Like for Hadoop Pipelines

Jenkins has been a mature, production-grade CI system for years. If your application engineering team is already using it for their deployments, you can use the same instance for data pipelines. Here's the shape of a minimal pipeline:

# Jenkinsfile (Groovy pipeline DSL)
node {
    stage('Checkout') {
        git url: 'git@your-server:data-team/pipeline-jobs.git'
    }

    stage('Build') {
        sh 'mvn clean package -DskipTests'
    }

    stage('Test') {
        sh 'mvn test'
        junit 'target/surefire-reports/*.xml'
    }

    stage('Deploy to Dev') {
        sh '''
            hadoop fs -put -f target/pipeline-jobs-1.0.jar /apps/pipeline/dev/
            hadoop fs -put -f workflows/daily_agg/workflow.xml /user/oozie/workflows/daily_agg_dev/
        '''
        sh 'oozie job -oozie http://oozie-server:11000/oozie -config job-dev.properties -run'
    }

    stage('Deploy to Prod') {
        input 'Deploy to production?'
        sh '''
            hadoop fs -put -f target/pipeline-jobs-1.0.jar /apps/pipeline/prod/
            hadoop fs -put -f workflows/daily_agg/workflow.xml /user/oozie/workflows/daily_agg_prod/
        '''
    }
}

That's the skeleton. The critical parts: build runs on a clean checkout, tests must pass before anything gets deployed, and the artifact deployed to production is the same artifact that passed tests — not a locally-rebuilt version.

Why the Same Artifact Matters

The ClassNotFoundException I described at the top came from a rebuild. The engineer built on their local machine (which had the newer library), pushed the JAR, and the library mismatch wasn't visible until production. If the pipeline had built once in CI and promoted that artifact through environments, the mismatch would have appeared in dev on the first build — not in production at 2 a.m.

Build once, promote everywhere. This is not a new idea. It's just one data teams haven't applied yet.

Oozie Configuration Belongs in Git

The other half of the deployment problem is the workflow XML. An Oozie workflow is code — it describes job dependencies, retry logic, error paths, and coordinator schedules. Editing it directly in HDFS (through Hue or otherwise) is the same problem as editing a Hive script on a network drive: no history, no review, no rollback.

Your Oozie workflow files should live in the same Git repository as your JAR source. CI deploys them to HDFS alongside the artifact. If a workflow change causes a failure, git revert and redeploy — same as any other code change.

The Objection

I know what you're thinking: "Our pipeline deployments are infrequent. This is overkill." The infrequency is actually the argument for CI/CD, not against it. When you deploy rarely, the manual process atrophies. People forget steps. Environments drift. The next deployment is more likely to fail, not less. An automated pipeline doesn't forget steps and doesn't drift.

If you're running this on Oozie and have a Jenkins setup worth sharing, or you've found a cleaner way to manage Hadoop artifact promotion, reach out. As always, I'm here to help.

Read more