Guide
Apache Airflow explained
Your analytics team runs forty-seven shell scripts on cron. Script B fails silently
because script A did not finish. Nobody knows which version ran last Tuesday. A new
engineer spends a week reverse-engineering dependencies from a shared
/opt/etl folder. Apache Airflow replaces that chaos with
explicit, version-controlled directed acyclic graphs (DAGs): each workflow
is code, tasks declare upstream/downstream relationships, the scheduler fires runs on
a timetable, and the web UI shows green, red, and retry history for every execution.
Airflow is not an ETL engine itself — it orchestrates Python/SQL/bash jobs,
sensors, and external triggers across your data stack. This guide covers the Airflow
architecture (scheduler, metadata DB, executor, workers), DAG anatomy and task
dependencies, operators and sensors, scheduling intervals and backfill, executors from
Local to Celery to Kubernetes, XComs and idempotent design, a Harbor Supply nightly
analytics worked example, an orchestrator decision table, common pitfalls, and a
practitioner checklist alongside
ETL fundamentals
and
Kafka event streaming.
What Airflow orchestrates (and what it does not)
Airflow sits in the workflow orchestration layer. It decides when tasks run, in what order, and what to do on failure — but the heavy lifting (Spark jobs, dbt models, COPY into Snowflake, training a model) happens inside task code you write or delegate to operators. Think of Airflow as the conductor, not every instrument.
That separation matters when choosing tools. You do not replace Kafka with Airflow: Kafka moves real-time events; Airflow schedules batch windows that consume those events after they land in object storage or a warehouse. You do not replace CI/CD either — deploy pipelines build artifacts; Airflow DAGs run recurring business logic on a schedule. The sweet spot is ETL/ELT, report generation, ML feature rebuilds, data-quality checks, and any multi-step job that outgrew cron plus Slack alerts.
Core components
- Scheduler — parses DAG files, creates
DagRuninstances for each logical date, enqueues ready tasks. - Metadata database — PostgreSQL or MySQL storing DAG definitions, run state, task instances, connections, and variables (the source of truth for the UI).
- Executor — decides where task processes run (same machine, Celery workers, Kubernetes pods).
- Web server — Flask UI for graph view, logs, manual triggers, and backfill.
- Workers — (with Celery/Kubernetes executors) processes that pick up queued task instances and execute operator code.
DAGs, tasks, and dependencies
A DAG is a collection of tasks with dependency edges and no cycles. Modern Airflow (2.x) defines DAGs in Python:
from airflow.decorators import dag, task
from datetime import datetime
@dag(
schedule="@daily",
start_date=datetime(2026, 1, 1),
catchup=False,
tags=["analytics"],
)
def harbor_nightly_orders():
@task
def extract():
return "s3://harbor/raw/orders/{{ ds }}/"
@task
def transform(path: str):
...
@task
def load(path: str):
...
raw = extract()
staged = transform(raw)
load(staged)
harbor_nightly_orders()
The @dag decorator sets schedule, start date, and default args. Task
decorators (or classic PythonOperator) become nodes; calling one task's
return value inside another wires the edge. The scheduler only runs a downstream task
when all upstream tasks for that logical_date have succeeded (or been
skipped per trigger rules).
Logical dates and data intervals
Airflow's execution_date (now often called logical date or
data_interval_start) is the start of the period the run represents,
not when the scheduler fired. A DAG scheduled @daily with logical date
2026-06-07 processes data for June 7th, typically starting shortly after midnight on
June 8th. Template variables like {{ ds }} (date string) and
{{ data_interval_start }} let tasks parameterize SQL and paths without
hard-coding wall-clock time — critical for idempotent reruns.
Trigger rules
Default all_success waits for every upstream task. Use
none_failed_min_one_success for optional branches (e.g. "run report if
either full load or incremental succeeded"). Misconfigured trigger rules are a top
cause of tasks stuck in upstream_failed or surprise skips.
Operators, sensors, and deferrable tasks
An operator is a template for a unit of work. Built-ins cover common integrations:
PythonOperator/@task— arbitrary Python callables.BashOperator— shell commands (use sparingly; prefer Python for testability).SQLExecuteQueryOperator— run SQL against Postgres, Snowflake, BigQuery, etc.S3FileSensor,ExternalTaskSensor— wait for files or another DAG's task.- Provider packages — hundreds of hooks for AWS, GCP, dbt, Snowflake, Kubernetes, and more.
Sensors poll until a condition is true (file lands, partition appears, upstream DAG completes). Classic sensors hold a worker slot while polling — expensive at scale. Deferrable operators (Airflow 2.2+) release the worker between polls and resume via a triggerer process, making "wait for S3 prefix" practical without tying up hundreds of workers.
TaskGroups collapse related tasks in the UI graph without changing
dependencies. Dynamic task mapping (.expand()) generates
parallel task instances from a list at runtime — e.g. one transform per country code
discovered during extract.
Scheduling, catchup, and backfill
Schedules can be cron expressions (0 6 * * *), presets (@daily,
@hourly), or custom Timetable classes. Set
catchup=False on new DAGs so Airflow does not enqueue hundreds of
historical runs the moment you deploy — a classic Day-1 incident.
Backfill replays a date range deliberately: useful after fixing a bug in
transform logic or onboarding a DAG with historical data. The CLI
airflow dags backfill or UI "Trigger DAG w/ config" creates runs per
logical date. Backfill must be idempotent — writing to partitioned
tables with INSERT OVERWRITE or merge keys, not blind appends that
duplicate rows.
SLAs and timeouts (execution_timeout,
dagrun_timeout) alert when runs exceed expectations. Pair with
retries and retry_delay for transient failures (API 503,
warehouse resume). Do not retry non-idempotent side effects without deduplication keys.
Executors: where tasks actually run
| Executor | How it works | Best for |
|---|---|---|
| Sequential | One task at a time on the scheduler machine | Local dev only |
| Local | Parallel subprocesses on one host | Small teams, single VM, < ~20 concurrent tasks |
| Celery | Queue tasks to a pool of worker VMs via Redis/RabbitMQ | Medium scale, fixed worker fleet, mature ops patterns |
| Kubernetes | Each task instance becomes a pod | Elastic scale, heavy/isolated jobs, existing K8s platform |
Production almost always separates the scheduler, webserver, metadata DB, and workers.
Managed offerings (AWS MWAA, Google Cloud Composer, Astronomer) hide executor plumbing.
On Kubernetes, use resource requests/limits per task via executor_config
so a single Spark submit does not OOM the node — see our
Kubernetes fundamentals guide
for pod scheduling basics.
XComs, connections, and secrets
XCom (cross-communication) passes small metadata between tasks — a file path, row count, partition list. Task decorators return values that downstream tasks receive as arguments. XComs live in the metadata DB; do not push multi-gigabyte DataFrames — write to S3/GCS and pass the URI. For large artifacts, object storage is the handoff layer; XCom is the sticky note on top.
Connections and Variables store credentials and config outside DAG code. Use Airflow's secrets backend integration (AWS Secrets Manager, Vault, GCP Secret Manager) rather than plaintext in the UI. DAG files themselves belong in git; rotate connection passwords without redeploying DAG logic.
Worked example: Harbor Supply nightly analytics
Harbor Supply runs a daily orders pipeline feeding finance and merchandising dashboards.
DAG harbor_nightly_orders schedule: 0 5 * * * UTC (after US
West Coast midnight close). Task graph:
- sensor_wait_landing — deferrable S3 sensor on
s3://harbor-raw/orders/dt={{ ds }}/_SUCCESSwith 4-hour timeout. - extract_validate — Python task counts files, checks schema hash, fails fast if column drift detected.
- transform_orders — runs dbt models
stg_ordersandfct_ordersfor logical date{{ ds }}(idempotent merge). - data_quality — SQL checks: null rate < 0.1%, revenue within 3σ of trailing 30-day average.
- load_dashboard — refreshes Looker PDT or exports Parquet to the BI bucket.
- notify_slack — posts summary XCom (rows processed, revenue delta);
trigger rule
all_doneso failures still alert.
Executor: Celery with 8 workers on r6i.xlarge instances; transform task
requests 4 vCPU / 16 GiB. Upstream dependency: real-time checkout events land via
Kafka Connect into the raw bucket hourly; the sensor gates on daily compaction completing.
On failure, retries=2 with 10-minute exponential backoff; SLA email if not green by 08:00 UTC.
Airflow vs Prefect vs Dagster vs cron decision table
| Need | Best fit | Why |
|---|---|---|
| Mature UI, huge provider ecosystem, enterprise adoption | Apache Airflow | Largest community, MWAA/Composer managed options, extensive operators |
| Python-native dynamic flows, minimal YAML | Prefect | Modern API, cloud-first observability, easier local dev for some teams |
| Data assets, lineage, and testing as first-class | Dagster | Software-defined assets, strong for analytics engineering culture |
| Single script, no dependencies, runs in 30 seconds | cron + systemd timer | Zero orchestration overhead until you need retries, UI, or DAG graphs |
| Real-time stream processing | Kafka + Flink / Kafka Streams | Sub-minute latency; Airflow is batch-oriented (minutes to hours) |
| Ad-hoc notebook exploration | Jupyter / Hex | Not production scheduling — promote stable logic into DAG tasks |
Common pitfalls
- Top-level DAG code side effects — DB calls or heavy imports at parse time slow the scheduler and can enqueue accidental runs; keep parse-time work minimal.
catchup=Trueon deploy — floods workers with months of backfill; always setcatchup=Falseunless you explicitly want history.- Non-idempotent loads — reruns duplicate rows; use partitions,
MERGE, or delete-then-insert per{{ ds }}. - Sensors without reschedule/defer — one worker per sensor per minute;
use deferrable sensors or
mode="reschedule". - Giant monolithic DAGs — one failure blocks unrelated reports; split by
domain with
ExternalTaskSensorfor true cross-DAG deps. - Storing secrets in DAG repos — connections belong in secrets backends,
not
Variabledefaults committed to git. - Ignoring pool and priority — heavy Spark tasks starve lightweight SQL
unless you assign pools (
spark_pool,default_pool). - Mutable
start_date— changing start_date creates new scheduling behavior; treat it as immutable after production launch.
Practitioner checklist
- Define DAGs as Python in git; review changes like application code.
- Set
catchup=Falsefor new DAGs; backfill deliberately with date bounds. - Parameterize tasks with
{{ ds }}/ data interval templates, notdatetime.now(). - Make every load idempotent for its logical date partition.
- Use deferrable sensors for external file/table readiness.
- Configure retries only for transient, idempotent failures.
- Assign pools and
execution_timeoutper task weight class. - Store credentials in a secrets backend; never in DAG source.
- Monitor scheduler heartbeat, task duration p95, and SLA misses — not only task success rate.
- Document owner, on-call rotation, and downstream consumers in DAG
doc_md.
Key takeaways
- Airflow orchestrates when and in what order batch workflows run — it is not a stream processor or warehouse.
- DAGs encode dependencies explicitly; the scheduler materializes one run per logical date with full audit history.
- Logical dates represent data periods, not wall-clock launch time —
design tasks around
{{ ds }}for safe reruns. - Executors scale from local dev to Celery workers to per-task Kubernetes pods; pick based on isolation and elasticity needs.
- Pair Airflow with idempotent ELT, deferrable sensors, and secrets backends — and reach for simpler cron until dependency graphs genuinely hurt.
Related reading
- ETL and ELT data pipelines explained — extract, transform, load patterns Airflow schedules
- Apache Kafka explained — real-time ingestion that feeds batch landing zones
- CI/CD pipelines explained — deploy DAG code separately from runtime orchestration
- Kubernetes fundamentals explained — pod-per-task executor patterns and resource limits