Cron Is Not an Orchestrator

Shannon Lowder

15 Apr 2015 — 3 min read

The data pipeline starts at 2 a.m. It's a cron job. It reads from a raw events table, transforms it, and writes to a reporting table. The reporting table is what the business sees at 8 a.m.

The upstream job that populates the raw events table runs at 1 a.m. Usually. Last Tuesday it finished at 2:47 a.m. because the upstream vendor delivered files late. Your 2 a.m. job ran on time, found partial data, and produced a report. Nobody knew until the business noticed the numbers were 30% low — and by then, two more jobs had run downstream of the report, compounding the problem.

Cron doesn't know about data. It knows about time. That distinction is the source of almost every silent failure in data pipelines that run on scheduled jobs.

What Cron Actually Does

Cron fires a command at a specified time. That's it. It doesn't know whether the upstream data is ready. It doesn't know whether the previous run succeeded. It doesn't retry intelligently. It doesn't skip if the dependency isn't satisfied. It fires the command at the time you specified, returns an exit code, and moves on.

For simple, independent jobs, that's fine. For pipelines with dependencies — where job B reads the output of job A — it's a maintenance problem disguised as a scheduling solution.

What Dependency-Aware Scheduling Looks Like

Luigi, open-sourced by Spotify in 2012 and now in active use at a growing number of shops, takes a different approach. You define tasks as Python classes with two key methods: requires(), which declares upstream dependencies, and output(), which declares what the task produces. The scheduler builds a dependency graph and only runs a task when all its dependencies have produced their outputs.

import luigi

class RawEventsReady(luigi.ExternalTask):
    date = luigi.DateParameter()

    def output(self):
        return luigi.LocalTarget(f"/data/raw/events/{self.date}/SUCCESS")

class SessionAggregation(luigi.Task):
    date = luigi.DateParameter()

    def requires(self):
        return RawEventsReady(date=self.date)

    def output(self):
        return luigi.LocalTarget(f"/data/sessions/{self.date}/part-00000")

    def run(self):
        # only runs when RawEventsReady.output() exists
        with self.output().open('w') as out:
            # do the aggregation work
            pass

class DailyReport(luigi.Task):
    date = luigi.DateParameter()

    def requires(self):
        return SessionAggregation(date=self.date)

    def output(self):
        return luigi.LocalTarget(f"/data/reports/{self.date}/report.csv")

    def run(self):
        # only runs when SessionAggregation.output() exists
        pass

Run luigi --module pipeline DailyReport --date 2015-04-01 and Luigi walks the dependency graph. If RawEventsReady hasn't produced its SUCCESS file, nothing downstream runs. No partial report. No 30%-low numbers.

The SUCCESS File Pattern

Notice the SUCCESS file in RawEventsReady.output(). This is a common pattern: the upstream job writes a sentinel file only when it completes successfully. Luigi checks for that file before running dependents. You can apply the same pattern to HDFS paths — luigi.contrib.hdfs.HdfsTarget — or S3 prefixes. The key is that "data is ready" has an explicit, checkable artifact rather than a time estimate.

Idempotency: The Other Thing Cron Gets Wrong

When a cron job fails, the typical response is to re-run it manually. If the job isn't idempotent — if running it twice produces double-counted results, or leaves the output in an inconsistent state — manual reruns create new problems while fixing the original one.

Luigi's output-based model pushes you toward idempotent tasks by design: if the output already exists, the task is considered done and won't re-run. You can delete the output and re-run cleanly. That's a much safer recovery pattern than "SSH in, truncate the table, re-run the script, hope."

When Cron Is Still Fine

If your job has no upstream data dependencies, runs independently, and produces output that doesn't feed anything else — cron works. A simple daily backup, a log rotation script, a report that reads from a stable source. The problem is specifically pipelines with data dependencies, where the time a job runs and the time its data is ready are not reliably the same thing.

If your team is on cron and the downstream data dependency problem sounds familiar, Luigi is worth a weekend. It's not a complete orchestration solution — Airflow (which just open-sourced last month) is a bigger conversation — but it solves the dependency problem directly and runs anywhere Python runs. As always, I'm here to help.

Cron Is Not an Orchestrator

Shannon Lowder

What Cron Actually Does

What Dependency-Aware Scheduling Looks Like

The SUCCESS File Pattern

Idempotency: The Other Thing Cron Gets Wrong

When Cron Is Still Fine

Read more

The Context Problem Neither Agent Mesh Nor OpenSharing Solves

Unity AI Gateway and What a Governed Model Access Layer Actually Buys You

You Don't Need Fable. You Need a Router.

DAIS 2026: Genie One and the Context Problem Databricks Is Solving