Guide

dbt (Data Build Tool) explained

Your data warehouse has two hundred tables. Finance queries a spreadsheet export because nobody trusts the orders_v3_final_FINAL view. Analysts copy-paste SQL into Slack; each "fix" forks another undocumented branch. dbt (data build tool) is the analytics-engineering layer that turns warehouse SQL into version-controlled, tested, documented pipelines: each transformation is a model file in git, compiled to dialect-specific SQL, executed in dependency order, and validated with assertions before BI tools ever see the data. dbt does not move bytes from SaaS APIs — ingestion stays in Fivetran, Airbyte, or custom loaders — and it is not a distributed compute engine like Spark. It transforms data already inside Snowflake, BigQuery, Postgres, Databricks, or Redshift using the warehouse's own query engine. This guide covers project anatomy, staging-to-mart layering, materializations and incremental models, Jinja and macros, tests and documentation lineage, orchestration with Airflow, a Harbor Supply revenue mart worked example, a tooling decision table, common pitfalls, and a practitioner checklist alongside our ETL and ELT fundamentals, Apache Airflow guide, and data warehouse overview.

What dbt transforms (and what it does not)

Modern analytics stacks split into three layers. Ingestion (EL) copies raw tables from operational databases, SaaS tools, and event streams into a warehouse or lakehouse. Transformation (the T in ELT) cleans, joins, and aggregates those raw tables into business-ready datasets. Consumption (Looker, Tableau, Hex, reverse ETL) reads the curated layer. dbt owns the middle step exclusively — and does it by treating SQL as software.

That scope boundary prevents expensive confusion. You do not replace Apache Spark with dbt when you need petabyte-scale Python UDFs or complex streaming windows — Spark computes; dbt compiles SELECT statements. You do not replace Airflow with dbt either; Airflow schedules dbt run as one task inside a broader DAG that might also wait for landing files, trigger ML training, or notify Slack. The sweet spot is analytics engineering: dimensional models, metric definitions, slowly changing dimensions, and the SQL glue between raw landing zones and executive dashboards.

Core project objects

Models — .sql files that become views or tables via dbt run; the heart of the project.
Sources — declared upstream tables (often raw ingestion schemas) with freshness checks and column documentation.
Seeds — CSV files versioned in git for small reference data (country codes, product taxonomy).
Snapshots — type-2 slowly changing dimension captures using timestamp or check strategies.
Tests — schema tests (unique, not_null, relationships) and custom SQL assertions.
Macros — reusable Jinja/SQL functions (date spines, pivot helpers, surrogate key generators).
Exposures — document downstream dashboards and ML models that depend on specific models.

Staging, intermediate, and mart layers

Mature dbt projects adopt a medallion-style layering adapted for SQL warehouses. Raw ingestion tables land in a raw or landing schema — wide, vendor-shaped, occasionally duplicated. dbt rarely queries raw directly in marts; it builds a ladder of models:

Staging (`stg_`)

One staging model per source table. Responsibilities: rename cryptic columns to snake_case, cast types, parse JSON blobs, filter test rows, deduplicate on primary key. Staging should be thin — no business joins yet. Example: stg_shopify__orders reads {{ source('shopify', 'orders') }} and outputs consistent column names for every downstream consumer.

Intermediate (`int_`)

Reusable logic blocks: order line items enriched with product attributes, sessionized clickstreams, unioned multi-region replicas. Intermediate models prevent copy-pasting the same 40-line join into five marts. They are not always exposed to BI — think private library functions implemented as SQL views.

Marts (`fct_`, `dim_`)

Business-facing facts and dimensions aligned to grain. fct_orders might grain one row per order with revenue, discount, and fulfillment metrics. dim_customers holds slowly changing attributes. Marts follow SQL star-schema or wide-table conventions your organization chooses — dbt enforces naming and tests, not a specific modeling religion.

Folder structure mirrors layers: models/staging/, models/intermediate/, models/marts/. Schema names in the warehouse often map via dbt_project.yml config — e.g. staging models land in analytics_staging, marts in analytics — keeping permissions clean for BI read roles.

Materializations: view, table, incremental, ephemeral

Each model declares how dbt persists SQL results via the materialized config:

view (default) — lightweight; recomputed on every query through the view. Good for staging and thin transforms on small/medium data.
table — full rebuild each dbt run; drops and recreates or uses CREATE TABLE AS SELECT. Simple but expensive on billion-row facts.
incremental — appends or merges only new/changed rows since last run; essential for large fact tables with append-only or upsert patterns.
ephemeral — inlined as CTEs into downstream models; never stored; reduces clutter when intermediate results are single-use.

Incremental strategies

Incremental models need an explicit filter on "what changed." Common patterns:

{{
  config(
    materialized='incremental',
    unique_key='order_id',
    incremental_strategy='merge'
  )
}}

select *
from {{ ref('int_orders_enriched') }}
{% if is_incremental() %}
  where updated_at > (select max(updated_at) from {{ this }})
{% endif %}

Snowflake and BigQuery support merge on a unique_key; Postgres often uses delete+insert or append with deduped staging. Choose strategy based on whether late-arriving updates must overwrite prior rows. Document the watermark column in model YAML so the next engineer knows why a full-refresh is required after schema migrations.

Run dbt run --full-refresh deliberately when logic changes invalidate historical rows — not on every deploy. CI environments typically build incremental models as tables from scratch on sample data to keep pipeline minutes low.

Jinja, `ref()`, and macros

dbt models are SQL files with Jinja templating. The two most important functions:

{{ ref('model_name') }} — references another dbt model; dbt builds the dependency graph and runs upstream models first. Never hard-code analytics.stg_orders when ref gives you environment-aware schema resolution.
{{ source('source_name', 'table_name') }} — references declared raw sources; enables freshness tests and lineage from ingestion boundary.

Macros encapsulate repeated logic. A generate_surrogate_key macro might hash multiple nullable columns into a stable ID. A cents_to_dollars macro standardizes currency conversion. Macros live in macros/ and are unit-testable via dbt run-operation in advanced setups.

Variables (var('start_date')) parameterize runs — useful for backfills orchestrated by Airflow passing --vars on the CLI. Packages (dbt_utils, dbt_expectations) import community macros via packages.yml, similar to npm dependencies — pin versions for reproducible builds.

Tests, documentation, and lineage

Untested transforms are liabilities. dbt ships schema tests declared in YAML alongside models:

models:
  - name: fct_orders
    columns:
      - name: order_id
        tests:
          - unique
          - not_null
      - name: customer_id
        tests:
          - relationships:
              to: ref('dim_customers')
              field: customer_id

dbt test runs these after builds (or in CI on pull requests). Add custom data tests — SQL files returning failing rows — for business rules: "refund amount never exceeds order total," "status transitions are valid," "daily revenue within 5% of source aggregate."

Documentation blocks in YAML feed dbt docs generate, producing a static site with column descriptions, model owners, and an interactive lineage DAG — which model feeds which, upstream to raw sources, downstream to exposures. That graph is often the first artifact executives understand when auditing data quality. Tag models (tags: ['finance', 'daily']) to run subsets: dbt run --select tag:finance.

Source freshness alerts when ingestion stalls: loaded_at_field: _fivetran_synced with warn_after: 6h catches broken connectors before morning dashboards go blank.

dbt Core vs Cloud, adapters, and orchestration

dbt Core is open-source CLI software you run locally, in CI, or on a scheduler. dbt Cloud adds hosted IDE, job scheduling, PR environments, and observability — valuable for teams without appetite to wire GitHub Actions themselves. Both compile the same project; choose based on ops burden, not SQL capability.

Adapters translate dbt's generic SQL to warehouse dialects: Snowflake, BigQuery, Redshift, Databricks, Postgres, DuckDB, and more. Adapter choice is dictated by where your data lives — the dbt project structure stays portable. Multi-warehouse orgs sometimes maintain separate target profiles (dev, prod, ci) in profiles.yml with role-based access: analysts write in dev schemas; production runs use a service account with write access only to analytics_* schemas.

Production schedules almost always invoke dbt from an orchestrator: dbt run && dbt test as an Airflow KubernetesPodOperator task, a GitHub Action on merge to main, or a dbt Cloud job webhook after Fivetran sync completes. The orchestrator handles retries, SLAs, and cross-system dependencies; dbt handles transform correctness.

Worked example: Harbor Supply revenue mart

Harbor Supply sells industrial fasteners online. Raw Shopify and Stripe tables land in Snowflake schema raw via Fivetran. The analytics team maintains dbt project harbor_analytics with this nightly path (triggered by Airflow after ingestion green):

Sources declared — shopify.orders, shopify.order_line_items, stripe.charges with freshness on _fivetran_synced.
Staging — stg_shopify__orders casts timestamps to UTC, normalizes currency codes, filters test = false. stg_stripe__charges maps Stripe cents to decimal dollars.
Intermediate — int_order_payments joins orders to charges on payment intent ID; flags partial captures and failed refunds.
Mart — fct_orders (incremental, merge on order_id) grains one row per order with gross revenue, net revenue, discount rate, and fulfillment SLA hours. dim_customers snapshot tracks email and segment changes.
Tests — order_id unique/not null; relationship to dim_customers; custom test ensuring sum of line items matches order total within $0.01.
Exposure — documents Looker dashboard "Executive Revenue" depending on fct_orders.

Runtime: incremental fct_orders processes ~40k changed rows nightly in 90 seconds; full refresh takes 22 minutes and runs only on schema migrations. Failed tests block the Airflow downstream task that refreshes Looker PDTs — finance never sees untested numbers. Developers use dbt run --select +fct_orders to rebuild a model and all upstream dependencies during feature work.

dbt vs Spark vs stored procedures vs hand-rolled SQL decision table

Need	Best fit	Why
Version-controlled SQL transforms in a cloud warehouse	dbt	Tests, docs, lineage, and modular layering out of the box
Petabyte-scale joins, Python ML, streaming windows	Apache Spark / Flink	Compute engine beyond warehouse SQL; feed results into dbt staging
Legacy team with decades of T-SQL stored procs	Incremental migration to dbt	Wrap procs gradually; dbt replaces untested procedural sprawl over quarters
One analyst, three tables, no CI	Hand-rolled views + spreadsheet tests	dbt overhead pays off around model count ~15 and second maintainer
Real-time metrics on event streams	Kafka + Flink / materialized stream tables	dbt is batch; stream results may land as sources dbt reads hourly
Cross-system job scheduling and retries	Airflow / Dagster / dbt Cloud jobs	dbt executes transforms; orchestrator owns clocks and dependencies

Common pitfalls

Skipping staging — marts query raw directly; vendor schema changes break five dashboards simultaneously. Always isolate ingestion shape in stg_ models.
Over-using tables — materializing every model as a table balloons storage and build time; default to views until query performance or downstream isolation demands persistence.
Incremental models without unique keys — append-only duplicates rows on rerun; define merge keys and late-arrival policies explicitly.
Circular ref() dependencies — dbt DAG must be acyclic; use intermediate models instead of mutual references.
Tests only in production — run dbt test in CI on every PR against a slim warehouse clone or sampled data.
Undocumented models — empty YAML descriptions rot trust; treat docs like code review requirements.
Running full refreshes casually — blows SLAs and duplicates merge history; require explicit approval for prod full-refresh.
Macro soup — ten layers of Jinja indirection obscure SQL; macros help repetition, not one-off logic.

Practitioner checklist

Structure models staging → intermediate → marts with consistent prefixes.
Declare all raw tables as sources with freshness where ingestion allows.
Use ref() and source() — never hard-code production schema names in marts.
Choose materialization per model: view until proven slow, incremental for large facts.
Test grain keys (unique, not_null) and critical relationships on every mart.
Add custom SQL tests for business invariants money and compliance care about.
Generate and publish dbt docs; link exposures to BI dashboards.
Run dbt build (run + test) in CI before merging to main.
Orchestrate prod runs after ingestion completes; alert on test failures before BI refresh.
Pin package versions; review adapter upgrades in staging before prod.

Key takeaways

dbt is the transformation layer in ELT — SQL models in git, not one-off warehouse scripts.
Layer staging, intermediate, and mart models so ingestion changes do not cascade silently.
Materializations and incremental strategies trade build time, storage, and correctness.
Tests and documentation lineage are first-class — they are how teams trust metrics.
Pair dbt with an orchestrator for scheduling; pair ingestion tools for the raw landing zone.