Guide
dbt (Data Build Tool) explained
Your data warehouse has two hundred tables. Finance queries a spreadsheet export because
nobody trusts the orders_v3_final_FINAL view. Analysts copy-paste SQL into
Slack; each "fix" forks another undocumented branch. dbt (data build tool)
is the analytics-engineering layer that turns warehouse SQL into version-controlled,
tested, documented pipelines: each transformation is a model file in git, compiled
to dialect-specific SQL, executed in dependency order, and validated with assertions before
BI tools ever see the data. dbt does not move bytes from SaaS APIs — ingestion stays in
Fivetran, Airbyte, or custom loaders — and it is not a distributed compute engine like
Spark. It transforms data already inside Snowflake, BigQuery, Postgres, Databricks,
or Redshift using the warehouse's own query engine. This guide covers project anatomy,
staging-to-mart layering, materializations and incremental models, Jinja and macros, tests
and documentation lineage, orchestration with Airflow, a Harbor Supply revenue mart worked
example, a tooling decision table, common pitfalls, and a practitioner checklist alongside
our
ETL and ELT fundamentals,
Apache Airflow guide, and
data warehouse overview.
What dbt transforms (and what it does not)
Modern analytics stacks split into three layers. Ingestion (EL) copies raw tables from operational databases, SaaS tools, and event streams into a warehouse or lakehouse. Transformation (the T in ELT) cleans, joins, and aggregates those raw tables into business-ready datasets. Consumption (Looker, Tableau, Hex, reverse ETL) reads the curated layer. dbt owns the middle step exclusively — and does it by treating SQL as software.
That scope boundary prevents expensive confusion. You do not replace
Apache Spark with dbt when you
need petabyte-scale Python UDFs or complex streaming windows — Spark computes; dbt compiles
SELECT statements. You do not replace
Airflow with dbt either;
Airflow schedules dbt run as one task inside a broader DAG that might also
wait for landing files, trigger ML training, or notify Slack. The sweet spot is
analytics engineering: dimensional models, metric definitions, slowly
changing dimensions, and the SQL glue between raw landing zones and executive dashboards.
Core project objects
- Models —
.sqlfiles that become views or tables viadbt run; the heart of the project. - Sources — declared upstream tables (often raw ingestion schemas) with freshness checks and column documentation.
- Seeds — CSV files versioned in git for small reference data (country codes, product taxonomy).
- Snapshots — type-2 slowly changing dimension captures using timestamp or check strategies.
- Tests — schema tests (
unique,not_null,relationships) and custom SQL assertions. - Macros — reusable Jinja/SQL functions (date spines, pivot helpers, surrogate key generators).
- Exposures — document downstream dashboards and ML models that depend on specific models.
Staging, intermediate, and mart layers
Mature dbt projects adopt a medallion-style layering adapted for SQL
warehouses. Raw ingestion tables land in a raw or landing
schema — wide, vendor-shaped, occasionally duplicated. dbt rarely queries raw directly
in marts; it builds a ladder of models:
Staging (stg_)
One staging model per source table. Responsibilities: rename cryptic columns to snake_case,
cast types, parse JSON blobs, filter test rows, deduplicate on primary key. Staging should
be thin — no business joins yet. Example:
stg_shopify__orders reads {{ source('shopify', 'orders') }} and
outputs consistent column names for every downstream consumer.
Intermediate (int_)
Reusable logic blocks: order line items enriched with product attributes, sessionized clickstreams, unioned multi-region replicas. Intermediate models prevent copy-pasting the same 40-line join into five marts. They are not always exposed to BI — think private library functions implemented as SQL views.
Marts (fct_, dim_)
Business-facing facts and dimensions aligned to grain. fct_orders might grain
one row per order with revenue, discount, and fulfillment metrics.
dim_customers holds slowly changing attributes. Marts follow
SQL star-schema or
wide-table conventions your organization chooses — dbt enforces naming and tests, not a
specific modeling religion.
Folder structure mirrors layers: models/staging/,
models/intermediate/, models/marts/. Schema names in the warehouse
often map via dbt_project.yml config — e.g. staging models land in
analytics_staging, marts in analytics — keeping permissions clean
for BI read roles.
Materializations: view, table, incremental, ephemeral
Each model declares how dbt persists SQL results via the materialized config:
- view (default) — lightweight; recomputed on every query through the view. Good for staging and thin transforms on small/medium data.
- table — full rebuild each
dbt run; drops and recreates or uses CREATE TABLE AS SELECT. Simple but expensive on billion-row facts. - incremental — appends or merges only new/changed rows since last run; essential for large fact tables with append-only or upsert patterns.
- ephemeral — inlined as CTEs into downstream models; never stored; reduces clutter when intermediate results are single-use.
Incremental strategies
Incremental models need an explicit filter on "what changed." Common patterns:
{{
config(
materialized='incremental',
unique_key='order_id',
incremental_strategy='merge'
)
}}
select *
from {{ ref('int_orders_enriched') }}
{% if is_incremental() %}
where updated_at > (select max(updated_at) from {{ this }})
{% endif %}
Snowflake and BigQuery support merge on a unique_key; Postgres
often uses delete+insert or append with deduped staging. Choose
strategy based on whether late-arriving updates must overwrite prior rows. Document the
watermark column in model YAML so the next engineer knows why a full-refresh is required
after schema migrations.
Run dbt run --full-refresh deliberately when logic changes invalidate
historical rows — not on every deploy. CI environments typically build incremental models
as tables from scratch on sample data to keep pipeline minutes low.
Jinja, ref(), and macros
dbt models are SQL files with Jinja templating. The two most important functions:
{{ ref('model_name') }}— references another dbt model; dbt builds the dependency graph and runs upstream models first. Never hard-codeanalytics.stg_orderswhenrefgives you environment-aware schema resolution.{{ source('source_name', 'table_name') }}— references declared raw sources; enables freshness tests and lineage from ingestion boundary.
Macros encapsulate repeated logic. A generate_surrogate_key
macro might hash multiple nullable columns into a stable ID. A cents_to_dollars
macro standardizes currency conversion. Macros live in macros/ and are
unit-testable via dbt run-operation in advanced setups.
Variables (var('start_date')) parameterize runs — useful for
backfills orchestrated by Airflow passing --vars on the CLI. Packages
(dbt_utils, dbt_expectations) import community macros via
packages.yml, similar to npm dependencies — pin versions for reproducible builds.
Tests, documentation, and lineage
Untested transforms are liabilities. dbt ships schema tests declared in YAML alongside models:
models:
- name: fct_orders
columns:
- name: order_id
tests:
- unique
- not_null
- name: customer_id
tests:
- relationships:
to: ref('dim_customers')
field: customer_id
dbt test runs these after builds (or in CI on pull requests). Add
custom data tests — SQL files returning failing rows — for business rules:
"refund amount never exceeds order total," "status transitions are valid," "daily revenue
within 5% of source aggregate."
Documentation blocks in YAML feed dbt docs generate, producing
a static site with column descriptions, model owners, and an interactive lineage DAG —
which model feeds which, upstream to raw sources, downstream to exposures. That graph is
often the first artifact executives understand when auditing data quality. Tag models
(tags: ['finance', 'daily']) to run subsets: dbt run --select tag:finance.
Source freshness alerts when ingestion stalls:
loaded_at_field: _fivetran_synced with warn_after: 6h catches
broken connectors before morning dashboards go blank.
dbt Core vs Cloud, adapters, and orchestration
dbt Core is open-source CLI software you run locally, in CI, or on a scheduler. dbt Cloud adds hosted IDE, job scheduling, PR environments, and observability — valuable for teams without appetite to wire GitHub Actions themselves. Both compile the same project; choose based on ops burden, not SQL capability.
Adapters translate dbt's generic SQL to warehouse dialects: Snowflake,
BigQuery, Redshift, Databricks, Postgres, DuckDB, and more. Adapter choice is dictated by
where your data lives — the dbt project structure stays portable. Multi-warehouse orgs
sometimes maintain separate target profiles (dev, prod,
ci) in profiles.yml with role-based access: analysts write in
dev schemas; production runs use a service account with write access only to
analytics_* schemas.
Production schedules almost always invoke dbt from an orchestrator:
dbt run && dbt test as an Airflow KubernetesPodOperator task, a GitHub Action
on merge to main, or a dbt Cloud job webhook after Fivetran sync completes. The orchestrator
handles retries, SLAs, and cross-system dependencies; dbt handles transform correctness.
Worked example: Harbor Supply revenue mart
Harbor Supply sells industrial fasteners online. Raw Shopify and Stripe tables land in
Snowflake schema raw via Fivetran. The analytics team maintains dbt project
harbor_analytics with this nightly path (triggered by Airflow after ingestion
green):
- Sources declared —
shopify.orders,shopify.order_line_items,stripe.chargeswith freshness on_fivetran_synced. - Staging —
stg_shopify__orderscasts timestamps to UTC, normalizes currency codes, filterstest = false.stg_stripe__chargesmaps Stripe cents to decimal dollars. - Intermediate —
int_order_paymentsjoins orders to charges on payment intent ID; flags partial captures and failed refunds. - Mart —
fct_orders(incremental, merge onorder_id) grains one row per order with gross revenue, net revenue, discount rate, and fulfillment SLA hours.dim_customerssnapshot tracks email and segment changes. - Tests —
order_idunique/not null; relationship todim_customers; custom test ensuring sum of line items matches order total within $0.01. - Exposure — documents Looker dashboard "Executive Revenue" depending on
fct_orders.
Runtime: incremental fct_orders processes ~40k changed rows nightly in 90
seconds; full refresh takes 22 minutes and runs only on schema migrations. Failed tests
block the Airflow downstream task that refreshes Looker PDTs — finance never sees untested
numbers. Developers use dbt run --select +fct_orders to rebuild a model and
all upstream dependencies during feature work.
dbt vs Spark vs stored procedures vs hand-rolled SQL decision table
| Need | Best fit | Why |
|---|---|---|
| Version-controlled SQL transforms in a cloud warehouse | dbt | Tests, docs, lineage, and modular layering out of the box |
| Petabyte-scale joins, Python ML, streaming windows | Apache Spark / Flink | Compute engine beyond warehouse SQL; feed results into dbt staging |
| Legacy team with decades of T-SQL stored procs | Incremental migration to dbt | Wrap procs gradually; dbt replaces untested procedural sprawl over quarters |
| One analyst, three tables, no CI | Hand-rolled views + spreadsheet tests | dbt overhead pays off around model count ~15 and second maintainer |
| Real-time metrics on event streams | Kafka + Flink / materialized stream tables | dbt is batch; stream results may land as sources dbt reads hourly |
| Cross-system job scheduling and retries | Airflow / Dagster / dbt Cloud jobs | dbt executes transforms; orchestrator owns clocks and dependencies |
Common pitfalls
- Skipping staging — marts query raw directly; vendor schema changes
break five dashboards simultaneously. Always isolate ingestion shape in
stg_models. - Over-using tables — materializing every model as a table balloons storage and build time; default to views until query performance or downstream isolation demands persistence.
- Incremental models without unique keys — append-only duplicates rows on rerun; define merge keys and late-arrival policies explicitly.
- Circular
ref()dependencies — dbt DAG must be acyclic; use intermediate models instead of mutual references. - Tests only in production — run
dbt testin CI on every PR against a slim warehouse clone or sampled data. - Undocumented models — empty YAML descriptions rot trust; treat docs like code review requirements.
- Running full refreshes casually — blows SLAs and duplicates merge history; require explicit approval for prod full-refresh.
- Macro soup — ten layers of Jinja indirection obscure SQL; macros help repetition, not one-off logic.
Practitioner checklist
- Structure models staging → intermediate → marts with consistent prefixes.
- Declare all raw tables as
sourceswith freshness where ingestion allows. - Use
ref()andsource()— never hard-code production schema names in marts. - Choose materialization per model: view until proven slow, incremental for large facts.
- Test grain keys (
unique,not_null) and critical relationships on every mart. - Add custom SQL tests for business invariants money and compliance care about.
- Generate and publish
dbt docs; link exposures to BI dashboards. - Run
dbt build(run + test) in CI before merging to main. - Orchestrate prod runs after ingestion completes; alert on test failures before BI refresh.
- Pin package versions; review adapter upgrades in staging before prod.
Key takeaways
- dbt is the transformation layer in ELT — SQL models in git, not one-off warehouse scripts.
- Layer staging, intermediate, and mart models so ingestion changes do not cascade silently.
- Materializations and incremental strategies trade build time, storage, and correctness.
- Tests and documentation lineage are first-class — they are how teams trust metrics.
- Pair dbt with an orchestrator for scheduling; pair ingestion tools for the raw landing zone.
Related reading
- ETL and ELT data pipelines explained — where dbt sits in the modern stack
- Apache Airflow explained — scheduling dbt runs in production DAGs
- Data warehouses and lakehouses explained — platforms dbt transforms against
- SQL query optimization explained — tuning the queries dbt compiles