Guide

Apache Airflow explained

Your analytics team runs forty-seven shell scripts on cron. Script B fails silently because script A did not finish. Nobody knows which version ran last Tuesday. A new engineer spends a week reverse-engineering dependencies from a shared /opt/etl folder. Apache Airflow replaces that chaos with explicit, version-controlled directed acyclic graphs (DAGs): each workflow is code, tasks declare upstream/downstream relationships, the scheduler fires runs on a timetable, and the web UI shows green, red, and retry history for every execution. Airflow is not an ETL engine itself — it orchestrates Python/SQL/bash jobs, sensors, and external triggers across your data stack. This guide covers the Airflow architecture (scheduler, metadata DB, executor, workers), DAG anatomy and task dependencies, operators and sensors, scheduling intervals and backfill, executors from Local to Celery to Kubernetes, XComs and idempotent design, a Harbor Supply nightly analytics worked example, an orchestrator decision table, common pitfalls, and a practitioner checklist alongside ETL fundamentals and Kafka event streaming.

What Airflow orchestrates (and what it does not)

Airflow sits in the workflow orchestration layer. It decides when tasks run, in what order, and what to do on failure — but the heavy lifting (Spark jobs, dbt models, COPY into Snowflake, training a model) happens inside task code you write or delegate to operators. Think of Airflow as the conductor, not every instrument.

That separation matters when choosing tools. You do not replace Kafka with Airflow: Kafka moves real-time events; Airflow schedules batch windows that consume those events after they land in object storage or a warehouse. You do not replace CI/CD either — deploy pipelines build artifacts; Airflow DAGs run recurring business logic on a schedule. The sweet spot is ETL/ELT, report generation, ML feature rebuilds, data-quality checks, and any multi-step job that outgrew cron plus Slack alerts.

Core components

Scheduler — parses DAG files, creates DagRun instances for each logical date, enqueues ready tasks.
Metadata database — PostgreSQL or MySQL storing DAG definitions, run state, task instances, connections, and variables (the source of truth for the UI).
Executor — decides where task processes run (same machine, Celery workers, Kubernetes pods).
Web server — Flask UI for graph view, logs, manual triggers, and backfill.
Workers — (with Celery/Kubernetes executors) processes that pick up queued task instances and execute operator code.

DAGs, tasks, and dependencies

A DAG is a collection of tasks with dependency edges and no cycles. Modern Airflow (2.x) defines DAGs in Python:

from airflow.decorators import dag, task
from datetime import datetime

@dag(
    schedule="@daily",
    start_date=datetime(2026, 1, 1),
    catchup=False,
    tags=["analytics"],
)
def harbor_nightly_orders():
    @task
    def extract():
        return "s3://harbor/raw/orders/{{ ds }}/"

    @task
    def transform(path: str):
        ...

    @task
    def load(path: str):
        ...

    raw = extract()
    staged = transform(raw)
    load(staged)

harbor_nightly_orders()

The @dag decorator sets schedule, start date, and default args. Task decorators (or classic PythonOperator) become nodes; calling one task's return value inside another wires the edge. The scheduler only runs a downstream task when all upstream tasks for that logical_date have succeeded (or been skipped per trigger rules).

Logical dates and data intervals

Airflow's execution_date (now often called logical date or data_interval_start) is the start of the period the run represents, not when the scheduler fired. A DAG scheduled @daily with logical date 2026-06-07 processes data for June 7th, typically starting shortly after midnight on June 8th. Template variables like {{ ds }} (date string) and {{ data_interval_start }} let tasks parameterize SQL and paths without hard-coding wall-clock time — critical for idempotent reruns.

Trigger rules

Default all_success waits for every upstream task. Use none_failed_min_one_success for optional branches (e.g. "run report if either full load or incremental succeeded"). Misconfigured trigger rules are a top cause of tasks stuck in upstream_failed or surprise skips.

Operators, sensors, and deferrable tasks

An operator is a template for a unit of work. Built-ins cover common integrations:

PythonOperator / @task — arbitrary Python callables.
BashOperator — shell commands (use sparingly; prefer Python for testability).
SQLExecuteQueryOperator — run SQL against Postgres, Snowflake, BigQuery, etc.
S3FileSensor, ExternalTaskSensor — wait for files or another DAG's task.
Provider packages — hundreds of hooks for AWS, GCP, dbt, Snowflake, Kubernetes, and more.

Sensors poll until a condition is true (file lands, partition appears, upstream DAG completes). Classic sensors hold a worker slot while polling — expensive at scale. Deferrable operators (Airflow 2.2+) release the worker between polls and resume via a triggerer process, making "wait for S3 prefix" practical without tying up hundreds of workers.

TaskGroups collapse related tasks in the UI graph without changing dependencies. Dynamic task mapping (.expand()) generates parallel task instances from a list at runtime — e.g. one transform per country code discovered during extract.

Scheduling, catchup, and backfill

Schedules can be cron expressions (0 6 * * *), presets (@daily, @hourly), or custom Timetable classes. Set catchup=False on new DAGs so Airflow does not enqueue hundreds of historical runs the moment you deploy — a classic Day-1 incident.

Backfill replays a date range deliberately: useful after fixing a bug in transform logic or onboarding a DAG with historical data. The CLI airflow dags backfill or UI "Trigger DAG w/ config" creates runs per logical date. Backfill must be idempotent — writing to partitioned tables with INSERT OVERWRITE or merge keys, not blind appends that duplicate rows.

SLAs and timeouts (execution_timeout, dagrun_timeout) alert when runs exceed expectations. Pair with retries and retry_delay for transient failures (API 503, warehouse resume). Do not retry non-idempotent side effects without deduplication keys.

Executors: where tasks actually run

Executor	How it works	Best for
Sequential	One task at a time on the scheduler machine	Local dev only
Local	Parallel subprocesses on one host	Small teams, single VM, < ~20 concurrent tasks
Celery	Queue tasks to a pool of worker VMs via Redis/RabbitMQ	Medium scale, fixed worker fleet, mature ops patterns
Kubernetes	Each task instance becomes a pod	Elastic scale, heavy/isolated jobs, existing K8s platform

Production almost always separates the scheduler, webserver, metadata DB, and workers. Managed offerings (AWS MWAA, Google Cloud Composer, Astronomer) hide executor plumbing. On Kubernetes, use resource requests/limits per task via executor_config so a single Spark submit does not OOM the node — see our Kubernetes fundamentals guide for pod scheduling basics.

XComs, connections, and secrets

XCom (cross-communication) passes small metadata between tasks — a file path, row count, partition list. Task decorators return values that downstream tasks receive as arguments. XComs live in the metadata DB; do not push multi-gigabyte DataFrames — write to S3/GCS and pass the URI. For large artifacts, object storage is the handoff layer; XCom is the sticky note on top.

Connections and Variables store credentials and config outside DAG code. Use Airflow's secrets backend integration (AWS Secrets Manager, Vault, GCP Secret Manager) rather than plaintext in the UI. DAG files themselves belong in git; rotate connection passwords without redeploying DAG logic.

Worked example: Harbor Supply nightly analytics

Harbor Supply runs a daily orders pipeline feeding finance and merchandising dashboards. DAG harbor_nightly_orders schedule: 0 5 * * * UTC (after US West Coast midnight close). Task graph:

sensor_wait_landing — deferrable S3 sensor on s3://harbor-raw/orders/dt={{ ds }}/_SUCCESS with 4-hour timeout.
extract_validate — Python task counts files, checks schema hash, fails fast if column drift detected.
transform_orders — runs dbt models stg_orders and fct_orders for logical date {{ ds }} (idempotent merge).
data_quality — SQL checks: null rate < 0.1%, revenue within 3σ of trailing 30-day average.
load_dashboard — refreshes Looker PDT or exports Parquet to the BI bucket.
notify_slack — posts summary XCom (rows processed, revenue delta); trigger rule all_done so failures still alert.

Executor: Celery with 8 workers on r6i.xlarge instances; transform task requests 4 vCPU / 16 GiB. Upstream dependency: real-time checkout events land via Kafka Connect into the raw bucket hourly; the sensor gates on daily compaction completing. On failure, retries=2 with 10-minute exponential backoff; SLA email if not green by 08:00 UTC.

Airflow vs Prefect vs Dagster vs cron decision table

Need	Best fit	Why
Mature UI, huge provider ecosystem, enterprise adoption	Apache Airflow	Largest community, MWAA/Composer managed options, extensive operators
Python-native dynamic flows, minimal YAML	Prefect	Modern API, cloud-first observability, easier local dev for some teams
Data assets, lineage, and testing as first-class	Dagster	Software-defined assets, strong for analytics engineering culture
Single script, no dependencies, runs in 30 seconds	cron + systemd timer	Zero orchestration overhead until you need retries, UI, or DAG graphs
Real-time stream processing	Kafka + Flink / Kafka Streams	Sub-minute latency; Airflow is batch-oriented (minutes to hours)
Ad-hoc notebook exploration	Jupyter / Hex	Not production scheduling — promote stable logic into DAG tasks

Common pitfalls

Top-level DAG code side effects — DB calls or heavy imports at parse time slow the scheduler and can enqueue accidental runs; keep parse-time work minimal.
catchup=True on deploy — floods workers with months of backfill; always set catchup=False unless you explicitly want history.
Non-idempotent loads — reruns duplicate rows; use partitions, MERGE, or delete-then-insert per {{ ds }}.
Sensors without reschedule/defer — one worker per sensor per minute; use deferrable sensors or mode="reschedule".
Giant monolithic DAGs — one failure blocks unrelated reports; split by domain with ExternalTaskSensor for true cross-DAG deps.
Storing secrets in DAG repos — connections belong in secrets backends, not Variable defaults committed to git.
Ignoring pool and priority — heavy Spark tasks starve lightweight SQL unless you assign pools (spark_pool, default_pool).
Mutable start_date — changing start_date creates new scheduling behavior; treat it as immutable after production launch.

Practitioner checklist

Define DAGs as Python in git; review changes like application code.
Set catchup=False for new DAGs; backfill deliberately with date bounds.
Parameterize tasks with {{ ds }} / data interval templates, not datetime.now().
Make every load idempotent for its logical date partition.
Use deferrable sensors for external file/table readiness.
Configure retries only for transient, idempotent failures.
Assign pools and execution_timeout per task weight class.
Store credentials in a secrets backend; never in DAG source.
Monitor scheduler heartbeat, task duration p95, and SLA misses — not only task success rate.
Document owner, on-call rotation, and downstream consumers in DAG doc_md.

Key takeaways

Airflow orchestrates when and in what order batch workflows run — it is not a stream processor or warehouse.
DAGs encode dependencies explicitly; the scheduler materializes one run per logical date with full audit history.
Logical dates represent data periods, not wall-clock launch time — design tasks around {{ ds }} for safe reruns.
Executors scale from local dev to Celery workers to per-task Kubernetes pods; pick based on isolation and elasticity needs.
Pair Airflow with idempotent ELT, deferrable sensors, and secrets backends — and reach for simpler cron until dependency graphs genuinely hurt.