Start With What You Want It To Do
Every data project I have ever been on started with someone telling me what they needed the system to do. Not how — what. "We need to load customer orders from the source system into the warehouse by 6 a.m. every day." "We need to flag records that failed validation so the ops team can review them." "We need to be able to reprocess any day's data if the source sends a correction."
That is the outcome. The system that delivers it does not exist yet. Your job is to decompose it into pieces small enough to build.
This is the reverse of the last two posts. We built up from small reliable units. Now we are going to start from the top and work down — the same instinct behind breaking an epic into user stories into tasks, applied directly to a data pipeline project.
Start With the Outcome Statement
Write it down in plain language. One sentence. No implementation detail.
Load validated customer orders from SourceSystem into Warehouse.Orders daily by 6 a.m., with the ability to reprocess any prior date on demand.
That sentence is your epic. Everything else is decomposition.
Break It Into Stories
Stories answer: what does the system need to be able to do? Not how it does it — what it does.
- Extract raw orders from SourceSystem for a given date
- Validate that extracted orders meet quality rules
- Transform validated orders into the Warehouse schema
- Load transformed orders into Warehouse.Orders
- Run the full pipeline for a scheduled date
- Reprocess any prior date without duplicating records
- Surface failures with enough context to diagnose them
Seven stories. Each one is a capability the finished system must have. Notice that none of them say "stored procedure" or "SSIS package" or "scheduled job." The implementation is not the story.
Break Each Story Into Tasks
Tasks answer: what do you need to build to deliver that story? Now the implementation enters.
Take "Extract raw orders from SourceSystem for a given date":
- Create
Staging.RawOrderstable with the source schema plus anExtractedAtaudit column - Write
dbo.ExtractRawOrders @BatchDate DATE— inserts from SourceSystem, filters deleted records - Test: run for a known date, verify row count matches source
Three tasks. The first two are buildable in an afternoon. The third tells you when you are done.
Do this for every story. You now have a complete task list for the pipeline, derived directly from the outcome statement — not from someone's intuition about what the system should look like.
The Shape You End Up With
When you map those tasks back to code, you will notice something: you have naturally arrived at the layered structure from the previous two posts. Extract, validate, transform, load — each is a unit with one job. The pipeline coordinator — the story "Run the full pipeline for a scheduled date" — is the layer that calls them in sequence.
This is not a coincidence. The decomposition from the top and the composition from the bottom converge on the same shape, because both are following the same principle: one thing, done well, with a clear boundary.
Reprocessing Is a Story, Not an Afterthought
Notice that "Reprocess any prior date without duplicating records" is a first-class story, not a feature someone asks for six months after launch. Naming it early forces the design question: how does ExtractRawOrders behave when rows for that date already exist in staging?
Answer it now, in a task:
- Add a
DELETE FROM Staging.RawOrders WHERE BatchDate = @BatchDateat the top ofExtractRawOrdersso reruns are idempotent
One task. Solves the story. Does not change any other unit.
The Gotcha: Stories That Are Already Tasks
The decomposition breaks down when stories are written at the wrong altitude. "Write a stored procedure to extract orders" is a task masquerading as a story. It describes implementation, not capability.
If you find yourself writing stories that already specify a table name, a procedure name, or a technology, you have skipped a level. Back up. What does the system need to be able to do? Write that first. The procedure is a task that delivers the story.
Keeping the altitude right is what lets you swap implementation without rewriting requirements — and in data engineering, you will swap implementation more than you expect.
The Series in One Paragraph
Build small, reliable units. Coordinate them with layers. Start from the outcome you need and decompose down to the tasks that deliver it. The top-down decomposition and the bottom-up composition arrive at the same architecture, because they are both expressions of the same idea: one thing, done well, with a clear boundary. That is the whole game.
If you have been using this kind of decomposition on your data projects and found a way to make it stick with your team, I would love to hear how. As always, I am here to help.