The Monolithic Pig Script Is a Trap: Separation of Concerns in Big Data Pipelines

Shannon Lowder

15 Jul 2013 — 3 min read

There's a Pig script at one client that is 847 lines long. It ingests raw clickstream data from HDFS, applies session reconstruction logic, calculates funnel conversion rates, joins against a user dimension, filters out internal traffic, handles three different upstream data formats based on a timestamp condition, and writes the final output to a reporting table — all in one file. One engineer wrote it. She left eight months ago.

Nobody on the current team fully understands it. They're afraid to touch it. When it breaks — and it has broken twice this year — they fix the immediate symptom and back away slowly.

That's a monolith. And it's the dominant architecture pattern in data engineering right now.

Why Monolithic Pipelines Happen

It's not laziness. The monolith emerges from a reasonable short-term choice: the fastest way to get data from A to B is to write one script that does everything. Pig and Hive make it easy to chain operations inline. Oozie lets you schedule the whole thing as one action. It works — until it needs to change.

Software engineering ran into the same problem with application code decades ago. The answer was the same principle data teams need to apply now: separation of concerns. One unit does one thing, with clear inputs and outputs. Multiple units compose into a pipeline. Each unit is independently testable, replaceable, and understandable.

What Decomposed Looks Like

Take that 847-line script and break it down by responsibility:

ingest_normalize.pig — reads raw clickstream, handles the three format variants, normalizes to a common schema, writes to a staging table
session_reconstruct.pig — reads from staging, applies session window logic, writes session records
funnel_calculate.hql — joins session records against the user dim, calculates conversion rates, writes to reporting layer
filter_internal.hql — post-process step to remove internal traffic flags from the reporting output

In Oozie, this becomes a workflow with four actions, each with its own error path and retry config:

<workflow-app name="clickstream-pipeline" xmlns="uri:oozie:workflow:0.4">
  <start to="ingest-normalize"/>

  <action name="ingest-normalize">
    <pig>
      <script>scripts/ingest_normalize.pig</script>
      <param>INPUT=${inputPath}</param>
      <param>OUTPUT=${stagingPath}</param>
    </pig>
    <ok to="session-reconstruct"/>
    <error to="notify-failure"/>
  </action>

  <action name="session-reconstruct">
    <pig>
      <script>scripts/session_reconstruct.pig</script>
      <param>INPUT=${stagingPath}</param>
      <param>OUTPUT=${sessionPath}</param>
    </pig>
    <ok to="funnel-calculate"/>
    <error to="notify-failure"/>
  </action>

  <!-- funnel-calculate and filter-internal actions follow same pattern -->

  <kill name="notify-failure">
    <message>Pipeline failed at ${wf:lastErrorNode()}: ${wf:errorMessage(wf:lastErrorNode())}</message>
  </kill>
  <end name="end"/>
</workflow-app>

Now when session-reconstruct breaks, you know exactly which stage failed and why. You can rerun just that stage with fixed input. You can replace the session logic entirely without touching the normalization code.

The Test Benefit

The modular version also makes the testing picture from last month's post actually tractable. You can't meaningfully unit test a 847-line script. You can test a 60-line normalization script: give it the three known input formats, assert it produces the expected normalized output for each. That's three tests you could write in an afternoon.

The Common Objection

The pushback I hear: "Breaking it into stages means more HDFS reads and writes between stages. That's slower." Sometimes true. The tradeoff is maintainability, debuggability, and fault isolation — and for most batch workloads running on overnight schedules, the I/O overhead is noise. If you're running a latency-sensitive near-real-time pipeline, you have a different problem and a different conversation. For the 90% of data warehouse pipelines running on daily batch schedules, the monolith is costing you more in engineering time than the stage boundaries cost in I/O.

If you've untangled a monolithic Pig or Hive pipeline and have war stories from the decomposition, I'd genuinely like to hear how it went. As always, I'm here to help.

The Monolithic Pig Script Is a Trap: Separation of Concerns in Big Data Pipelines

Shannon Lowder

Why Monolithic Pipelines Happen

What Decomposed Looks Like

The Test Benefit

The Common Objection

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving