Guide
ETL and ELT data pipelines explained
Your product database knows every order, click, and signup. Your analytics team needs dashboards, cohort reports, and ML features built from that history — without slowing down the live app. Data pipelines move information from operational systems into warehouses and lakes where analysts can query freely. The two dominant patterns are ETL (extract, transform, load) and ELT (extract, load, transform). This guide explains when each fits, how batch and streaming pipelines differ, orchestration with tools like Airflow and dbt, incremental and idempotent loads, schema evolution, data quality gates, and a checklist before you ship pipelines that silently corrupt downstream metrics.
What ETL and ELT actually do
Every pipeline has three conceptual stages, even if tooling blurs the boundaries:
- Extract — read data from a source: Postgres tables, SaaS APIs, log files, S3 objects, or a change data capture stream.
- Transform — clean, join, deduplicate, aggregate, and apply business rules (currency conversion, session stitching, PII masking).
- Load — write results into a destination: Snowflake, BigQuery, Redshift, DuckDB, or an object store.
In ETL, transformation happens before loading — often on a separate compute cluster (Spark, custom Python workers) that outputs warehouse-ready tables. In ELT, raw data lands in the warehouse first; SQL-based transforms (dbt models, stored procedures) run inside the warehouse's elastic compute. Cloud warehouses made ELT popular because their columnar engines and separation of storage and compute make heavy SQL transforms cheap and versionable.
Neither is universally better. ETL suits strict egress controls (mask PII before it leaves the source region), heavy Python/ML preprocessing, or legacy sources that cannot push raw dumps into a warehouse. ELT suits teams that want analysts to explore raw landing tables and iterate on transforms in SQL with git-backed dbt projects.
Batch vs streaming pipelines
Batch pipelines run on a schedule — hourly, nightly, or
weekly. They read a bounded slice of data (yesterday's partition, rows
where updated_at > last_watermark), transform it, and
load results. Batch is simpler to reason about, cheaper at small scale,
and sufficient when dashboards tolerate minutes or hours of lag.
Streaming pipelines process events continuously as they arrive — Kafka consumers, Flink jobs, or warehouse streaming inserts. Latency drops to seconds, which matters for fraud detection, operational alerts, and real-time personalization. Complexity rises: you need exactly-once or at-least-once semantics, windowed aggregations, and handling of late-arriving events.
Micro-batch as a middle ground
Spark Structured Streaming and similar frameworks process data in small triggered batches (every 30 seconds) while presenting a streaming API. For many teams, micro-batch delivers "fast enough" freshness without operating a full event-time watermark stack on day one.
Choosing batch or stream
Start batch unless a product requirement demands sub-minute freshness. Nightly ETL covering 95% of analytics needs leaves streaming for the narrow paths that justify the operational cost. Hybrid architectures are common: CDC streams feed a real-time layer while nightly batch reconciles and backfills gaps.
Pipeline architecture layers
Mature data platforms organize work into layers with clear contracts:
- Bronze (raw landing) — append-only copies of source data, minimal transformation, full history preserved for replay.
- Silver (cleaned) — deduplicated, typed, conformed schemas; surrogate keys; slowly changing dimension handling.
- Gold (curated) — business-level aggregates and fact tables ready for BI tools and ML feature stores.
This medallion pattern isolates breakage: a bad transform in gold can be
rebuilt from silver without re-extracting from production databases.
Document column lineage so analysts know whether revenue_usd
includes tax and refunds.
Sources and extraction patterns
Full-table snapshots are simplest but do not scale. Prefer
incremental extraction using a monotonic cursor
(updated_at, auto-increment ID, or binlog offset). API
sources need pagination discipline and respect for rate limits — reuse
patterns from
API pagination
and
exponential backoff
when polling SaaS endpoints.
For operational databases, log-based CDC captures every row change without expensive full scans. Pair CDC with periodic snapshot reconciliation to catch connector gaps.
Orchestration and tooling
A pipeline is more than one script — it is a DAG of dependent tasks with retries, alerting, and backfill support.
Apache Airflow
Airflow schedules DAGs of Python operators: extract task, then transform, then load, with sensors waiting on upstream partitions. Strengths include rich ecosystem, backfill by date range, and explicit dependency graphs. Weaknesses: operational overhead (scheduler, workers, metadata DB) and temptation to embed heavy transforms in Python instead of the warehouse.
dbt (data build tool)
dbt manages SQL transforms as versioned models with ref()
dependencies, tests, and documentation. It fits the ELT pattern: raw
tables land via Fivetran, Airbyte, or custom loaders; dbt builds silver
and gold. Built-in tests catch null primary keys and referential
violations before executives see broken charts.
Managed ingestion services
Fivetran, Airbyte, and cloud-native connectors handle schema drift, incremental sync, and API auth — trading control for speed. Evaluate total cost at your row volume; self-hosted Airbyte on Kubernetes shifts cost to engineering time.
Stream processors
Kafka + Flink, Spark Streaming, or managed Kinesis/GCP Dataflow handle high-throughput transforms with stateful windows. Integrate with event-driven architecture patterns when the product already publishes domain events to a bus.
Incremental loads and idempotency
Pipelines fail and restart. Without idempotent loads, a retry doubles revenue or duplicates users in your warehouse.
Merge (upsert) patterns
Load into a staging table, then MERGE INTO the target on
primary key — update changed rows, insert new ones, optionally delete
missing keys if the source is a full snapshot. Snowflake, BigQuery, and
Postgres support merge semantics; older systems use delete-and-insert
per partition.
Watermark tracking
Store the last successful cursor (max updated_at, binlog
position) in a metadata table. On restart, resume from the watermark, not
from scratch. Clock skew on source systems can skip rows if you use
> instead of >= — overlap windows by a
few minutes and deduplicate in transform.
Partitioning and late data
Partition fact tables by event_date. Reprocess the last
N days on each run to absorb late arrivals without full reloads.
Document the lateness SLA ("events may arrive up to 48 hours late") so
analysts do not misinterpret moving numbers.
Schema evolution and data quality
Source teams add columns, rename fields, and change types without asking analytics. Pipelines must degrade gracefully.
- Schema registry — version Avro/Protobuf/JSON schemas for streaming events; reject or quarantine incompatible payloads.
- Nullable new columns — additive changes are safe; renames should go through a dual-write period with both old and new fields populated.
- Breaking type changes — treat as a new column; never silently cast strings to integers.
Data quality checks
Run assertions before promoting data to gold:
- Row count within expected bounds vs yesterday (anomaly detection).
- Null rate on critical columns below threshold.
- Referential integrity between facts and dimensions.
- Sum of order amounts matches finance source within tolerance.
Fail the pipeline or route bad batches to a quarantine table — never silently pass corrupt data. Alert on-call when checks fail; downstream dashboards should show a staleness banner, not stale numbers presented as current.
Performance, cost, and safety
Warehouse compute is billed by the second. Full-table scans on billion-row facts burn budget fast.
- Clustering and partitioning — align physical layout with filter columns (
event_date,tenant_id). - Incremental models — dbt incremental materializations process only new rows.
- Off-peak scheduling — run heavy transforms when spot/preemptible pricing applies.
- Read replicas — extract from read replicas, never the primary, for large batch pulls.
Security and compliance
Mask or hash PII in bronze if regulations require it. Restrict warehouse roles so raw landing tables are not world-readable. Audit who can run ad-hoc SQL against production extracts. Separate dev/staging warehouses from production gold to prevent test queries from exhausting shared quotas.
Common failure modes
- Silent schema drift — new API field ignored for weeks; metric definitions drift from product reality.
- Double counting on retry — append-only loads without dedup keys inflate totals after Airflow retries.
- Timezone bugs — mixing UTC source timestamps with local-date partitions shifts daily aggregates.
- Full refresh timeouts — rebuilding a terabyte model nightly; switch to incremental or widen the schedule.
- Orphaned dependencies — gold table refreshes before silver finishes; use orchestrator sensors or dbt
ref()ordering. - Production extract load — heavy JDBC pull locks rows or spikes IOPS on the primary; use replicas or CDC.
Production checklist
- Define freshness SLAs per dataset (batch nightly vs stream 60s) and measure actual lag.
- Choose ETL vs ELT based on where transforms run and who owns SQL vs Python.
- Implement incremental extraction with watermarks and overlap windows — avoid full-table pulls at scale.
- Make every load idempotent via merge/upsert on natural or surrogate keys.
- Organize bronze / silver / gold layers with documented lineage and replay paths.
- Add data quality tests that block promotion on row-count anomalies and null spikes.
- Plan schema evolution — additive columns, dual-write renames, quarantine for breaking changes.
- Extract from replicas or CDC, alert on pipeline failure, and show staleness in BI when data is old.
Key takeaways
- ETL transforms before load; ELT loads raw first — cloud warehouses favor ELT with SQL/dbt, while ETL suits heavy preprocessing and strict egress.
- Batch is the default; stream when freshness demands it — hybrid CDC plus nightly reconciliation covers most real-world needs.
- Idempotent incremental loads prevent duplicate metrics — merge on keys, track watermarks, overlap windows for clock skew.
- Medallion layers enable safe replay — preserve raw bronze so gold can be rebuilt without re-hitting production.
- Quality gates beat silent corruption — fail pipelines and surface staleness rather than shipping wrong numbers to executives.
Related reading
- Change data capture (CDC) explained — log-based streaming from Postgres WAL and MySQL binlog
- Database replication explained — read replicas, lag, and safe extraction sources
- Event-driven architecture explained — domain events as pipeline inputs
- Transactional outbox pattern explained — reliable event publication from OLTP databases