Airflow Just Shipped: Why a DAG Is the Right Mental Model for Data Pipelines
Airflow open-sourced out of Airbnb in June 2015. I've been running it internally for a few weeks now and I want to explain the core idea, because it's easy to look at the setup overhead and dismiss it as "Luigi with a web UI." It's not. The underlying mental model is different, and that difference is what makes it worth the investment.
The mental model is the DAG — the directed acyclic graph. Your entire pipeline is a graph of tasks with dependency edges between them. The scheduler reads the graph, determines which tasks are ready to run (all their upstream dependencies are satisfied), and executes them. This is not a new idea in computer science. It is a new idea in data pipeline tooling for most teams.
What Makes Airflow Different from Luigi
Luigi, covered in a post a few months back, also uses a dependency graph. The key differences with Airflow:
- DAGs are time-aware. Airflow has a concept of
execution_date— the logical date a DAG run represents. This means backfilling is built in: you can tell Airflow to run the DAG for every day from January to June and it will do it systematically, maintaining the dependency graph across all those historical runs. - DAGs are code. An Airflow DAG is a Python file. The scheduler reads Python files from a
dags/directory. The DAG definition is code you can version control, review, and test — not XML, not YAML, not a UI-configured workflow. - The scheduler is persistent. Airflow runs a continuous scheduler process that watches your DAG definitions and triggers tasks according to their schedule. You don't manually fire runs — the scheduler handles it.
A Minimal Airflow DAG
from datetime import datetime
from airflow import DAG
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'data-team',
'start_date': datetime(2015, 9, 1),
'retries': 1,
}
dag = DAG(
'daily_session_pipeline',
default_args=default_args,
schedule_interval='0 2 * * *', # 2am daily
)
check_upstream = BashOperator(
task_id='check_upstream_data',
bash_command='hdfs dfs -test -e /data/raw/events/{{ ds }}/SUCCESS',
dag=dag,
)
run_aggregation = BashOperator(
task_id='run_session_aggregation',
bash_command='spark-submit --master yarn jobs/session_agg.py {{ ds }}',
dag=dag,
)
write_report = PythonOperator(
task_id='write_daily_report',
python_callable=generate_report,
op_kwargs={'date': '{{ ds }}'},
dag=dag,
)
check_upstream >> run_aggregation >> write_reportThe {{ ds }} is Airflow's templating — it resolves to the execution date string for each run. Every task in the DAG gets the same execution date, which means your tasks can be written to process a specific partition date rather than "today minus one" hacks.
The Execution Date Is the Key Idea
This is the thing that took me a moment to fully internalize. Airflow's execution date is the logical date of the data being processed, not the wall-clock time the task runs. A DAG scheduled for 2 a.m. on September 15th has an execution date of September 14th — because it's processing the previous day's data.
This means that when you backfill — when you tell Airflow to process the last six months because you added a new transformation — each historical run gets its correct execution date, and your tasks process the right partition. No manual date parameter passing. No "change the hardcoded date in the script" deployment step.
What's Not Ready Yet
Honest assessment: Airflow in mid-2015 is early. The scheduler has known reliability issues under heavy load. The web UI works but is rough. The documentation is thin. The community is small but growing fast — the GitHub activity is encouraging.
I'm running it on internal pipelines and I'm watching it closely. The design is right. The execution will improve. If you're evaluating orchestration options for a new pipeline project, Airflow is worth putting on your shortlist alongside Luigi, even at this stage of maturity.
If you're already running Airflow and have hit the scheduler reliability issues, I'd like to compare notes. As always, I'm here to help.