Cron Is Not an Orchestrator
The data pipeline starts at 2 a.m. It's a cron job. It reads from a raw events table, transforms it, and writes to a reporting table. The reporting table is what the business sees at 8 a.m.
The upstream job that populates the raw events table runs at 1 a.m. Usually. Last Tuesday it finished at 2:47 a.m. because the upstream vendor delivered files late. Your 2 a.m. job ran on time, found partial data, and produced a report. Nobody knew until the business noticed the numbers were 30% low — and by then, two more jobs had run downstream of the report, compounding the problem.
Cron doesn't know about data. It knows about time. That distinction is the source of almost every silent failure in data pipelines that run on scheduled jobs.
What Cron Actually Does
Cron fires a command at a specified time. That's it. It doesn't know whether the upstream data is ready. It doesn't know whether the previous run succeeded. It doesn't retry intelligently. It doesn't skip if the dependency isn't satisfied. It fires the command at the time you specified, returns an exit code, and moves on.
For simple, independent jobs, that's fine. For pipelines with dependencies — where job B reads the output of job A — it's a maintenance problem disguised as a scheduling solution.
What Dependency-Aware Scheduling Looks Like
Luigi, open-sourced by Spotify in 2012 and now in active use at a growing number of shops, takes a different approach. You define tasks as Python classes with two key methods: requires(), which declares upstream dependencies, and output(), which declares what the task produces. The scheduler builds a dependency graph and only runs a task when all its dependencies have produced their outputs.
import luigi
class RawEventsReady(luigi.ExternalTask):
date = luigi.DateParameter()
def output(self):
return luigi.LocalTarget(f"/data/raw/events/{self.date}/SUCCESS")
class SessionAggregation(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return RawEventsReady(date=self.date)
def output(self):
return luigi.LocalTarget(f"/data/sessions/{self.date}/part-00000")
def run(self):
# only runs when RawEventsReady.output() exists
with self.output().open('w') as out:
# do the aggregation work
pass
class DailyReport(luigi.Task):
date = luigi.DateParameter()
def requires(self):
return SessionAggregation(date=self.date)
def output(self):
return luigi.LocalTarget(f"/data/reports/{self.date}/report.csv")
def run(self):
# only runs when SessionAggregation.output() exists
passRun luigi --module pipeline DailyReport --date 2015-04-01 and Luigi walks the dependency graph. If RawEventsReady hasn't produced its SUCCESS file, nothing downstream runs. No partial report. No 30%-low numbers.
The SUCCESS File Pattern
Notice the SUCCESS file in RawEventsReady.output(). This is a common pattern: the upstream job writes a sentinel file only when it completes successfully. Luigi checks for that file before running dependents. You can apply the same pattern to HDFS paths — luigi.contrib.hdfs.HdfsTarget — or S3 prefixes. The key is that "data is ready" has an explicit, checkable artifact rather than a time estimate.
Idempotency: The Other Thing Cron Gets Wrong
When a cron job fails, the typical response is to re-run it manually. If the job isn't idempotent — if running it twice produces double-counted results, or leaves the output in an inconsistent state — manual reruns create new problems while fixing the original one.
Luigi's output-based model pushes you toward idempotent tasks by design: if the output already exists, the task is considered done and won't re-run. You can delete the output and re-run cleanly. That's a much safer recovery pattern than "SSH in, truncate the table, re-run the script, hope."
When Cron Is Still Fine
If your job has no upstream data dependencies, runs independently, and produces output that doesn't feed anything else — cron works. A simple daily backup, a log rotation script, a report that reads from a stable source. The problem is specifically pipelines with data dependencies, where the time a job runs and the time its data is ready are not reliably the same thing.
If your team is on cron and the downstream data dependency problem sounds familiar, Luigi is worth a weekend. It's not a complete orchestration solution — Airflow (which just open-sourced last month) is a bigger conversation — but it solves the dependency problem directly and runs anywhere Python runs. As always, I'm here to help.