Guide

Data warehouses and lakehouses explained

Your production Postgres or MySQL database is optimized for one thing: fast, correct transactions — insert an order, update inventory, charge a card. Analysts asking “what was revenue by region last quarter across three years of history?” will crush that same database. Data warehouses exist to answer analytical questions at scale. Data lakes store vast raw files cheaply. Lakehouses merge lake economics with warehouse reliability. This guide explains OLTP vs OLAP workloads, dimensional modeling with star schemas, major platforms (Snowflake, BigQuery, Delta Lake, Iceberg), how ETL and ELT pipelines feed them, query performance tactics, governance, and when to pick each architecture instead of querying your app database directly.

OLTP vs OLAP: two different database jobs

Online transaction processing (OLTP) systems handle high-concurrency reads and writes on narrow rows: “fetch user 42,” “deduct one seat from inventory.” Indexes target primary keys; queries touch few rows; latency is measured in milliseconds. Your app database is OLTP.

Online analytical processing (OLAP) systems run aggregations over millions or billions of rows: sums, counts, window functions, cohort retention curves. Queries scan wide time ranges and many columns. Throughput and cost per terabyte scanned matter more than single-digit-millisecond point lookups.

Running heavy OLAP on OLTP creates predictable pain: table scans lock rows, replicas lag, and dashboards time out. The standard fix is physically separate analytical storage — loaded by pipelines from change data capture or nightly batch jobs — so product traffic and BI queries never fight for the same disks.

What a data warehouse is

A data warehouse is a centralized store of curated, query-ready data optimized for analytics. Unlike a normalized OLTP schema designed to prevent update anomalies (see database normalization), warehouse tables are often denormalized on purpose so common reports need fewer joins.

Warehouses are typically column-oriented or hybrid columnar: Snowflake, Google BigQuery, Amazon Redshift, Azure Synapse, and ClickHouse all store data by column so aggregations like SUM(revenue) read only the columns they need. Managed warehouses handle storage scaling, automatic clustering, and separation of compute from storage — you spin up a bigger warehouse for a heavy report, then scale down.

Data lands through ETL or ELT jobs: raw extracts become cleaned fact and dimension tables, tested with row-count and uniqueness checks, then exposed to BI tools (Looker, Tableau, Metabase) or notebooks.

Dimensional modeling: facts, dimensions, and star schemas

Ralph Kimball’s dimensional modeling is the dominant warehouse design pattern. Two table types:

Fact tables — numeric measurements at a grain you can sum: one row per order line, per page view, per ad impression. Columns are metrics (revenue, quantity, session_duration) plus foreign keys to dimensions.
Dimension tables — descriptive attributes: dim_customer (name, segment, signup date), dim_product (category, brand), dim_date (fiscal quarter, holiday flag).

A star schema places one fact table at the center with dimensions radiating outward — simple joins, fast for BI tools. A snowflake schema normalizes dimensions further (category split into sub-tables); saves storage but adds join depth. Most teams start with stars and snowflake only when dimension rows explode in size.

Slowly changing dimensions (SCD) track history when attributes change. Type 1 overwrites (customer moves city — old value lost). Type 2 inserts a new row with effective dates (history preserved for “revenue by region as-of Q1”). Type 3 keeps limited prior values in extra columns. Pick SCD type per dimension based on whether reports must be restated historically.

Data lakes: cheap storage, schema on read

A data lake stores files — Parquet, ORC, JSON, CSV, images, logs — in object storage (S3, GCS, Azure Blob) without forcing upfront schema. Data engineers land raw extracts in a bronze zone, clean them into silver, and publish curated gold tables for consumers. That medallion layering appears in Databricks and many in-house stacks.

Lakes excel when:

Volume is huge (telemetry, clickstreams, IoT) and you cannot afford warehouse ingest pricing on every byte.
Schemas evolve frequently or differ by source — schema-on-read defers commitment.
Data science needs raw features, not just curated aggregates.

The classic lake downside is the data swamp: petabytes of files nobody trusts, no ACID transactions, concurrent writers corrupting Parquet folders, and “which file is the truth?” arguments. That gap motivated the lakehouse.

Lakehouses: warehouse guarantees on lake storage

A lakehouse adds a transactional metadata layer on top of object storage so tables get warehouse-like features: ACID commits, time travel, schema enforcement, and concurrent reads while writes proceed. Open table formats include Delta Lake (Databricks), Apache Iceberg (Netflix, Apple, many vendors), and Apache Hudi.

Iceberg and Delta store a manifest of data files plus snapshot log. Queries prune files with partition and column statistics; compaction merges small files; MERGE statements handle upserts for CDC-fed dimensions. You can query the same table from Spark, Trino, DuckDB, or Snowflake external tables — reducing lock-in versus proprietary warehouse storage.

Practically: startups often begin on BigQuery or Snowflake (managed, fast time-to-dashboard). As data volume and ML workloads grow, teams add a lakehouse tier for raw retention and training features while keeping gold aggregates in the warehouse — or consolidate on one lakehouse if vendor surface area is acceptable.

Query performance: partitioning, clustering, and pruning

Analytical cost scales with bytes scanned, not row count alone. Three levers dominate:

Partitioning — physically group files by a high-cardinality filter column, usually date (event_date). A query for last 7 days skips years of history. Over-partitioning (millions of tiny folders) hurts listing performance — aim for partitions roughly 100 MB–1 GB each.
Clustering / sorting — within partitions, sort or Z-order by columns used together in WHERE and JOIN (customer_id, product_id). Snowflake auto-clustering and BigQuery clustering keys serve the same goal.
Materialized views and aggregates — precompute daily rollups analysts hit constantly; refresh incrementally from CDC streams.

Always inspect query plans. In warehouses, look at partitions scanned vs total; in Spark, check file-skipping metrics. A full-table scan on a billion-row fact table is a budget incident, not a surprise — see SQL query optimization for predicate and join tactics that transfer directly to OLAP engines.

Governance, catalog, and trust

Analytics fails when teams do not agree on definitions. A metrics layer (dbt semantic models, LookML, MetricFlow) encodes “active user” and “net revenue” once. A data catalog (DataHub, Alation, Glue) indexes tables, owners, lineage from pipeline to dashboard, and PII tags.

Row- and column-level security restrict sensitive fields (email, SSN) by role. Audit logs record who queried what — required for SOC 2 and GDPR. Lineage also answers incident questions: “Dashboard X is wrong — which upstream CDC job broke?”

Treat the warehouse as a product with SLAs: freshness (“orders fact updated within 15 minutes”), completeness (row counts vs source), and documented deprecation windows when renaming columns.

Warehouse vs lake vs lakehouse: decision guide

Need	Best fit
Fast BI dashboards, small team, managed ops	Cloud warehouse (BigQuery, Snowflake, Redshift)
Massive raw logs, ML feature store, lowest $/TB storage	Data lake (S3 + Spark) with medallion layers
ACID upserts on lake files, open formats, multi-engine access	Lakehouse (Delta, Iceberg, Hudi)
Sub-second lookups on single keys for apps	OLTP database or Redis — not a warehouse
Real-time metrics on event streams	Stream processor + OLAP (Flink, Materialize) feeding warehouse

Many mature organizations run both: lakehouse for raw and ML, warehouse gold layer for finance and exec dashboards — linked by shared dimensions and a single metrics catalog.

Common anti-patterns

Reporting directly on production OLTP — works until it does not; schedule pipelines instead.
One giant wide table with no grain documented — double-counting revenue when joins fan out.
Lake without ownership — anonymous Parquet dumps become unmaintainable swamps.
Ignoring late-arriving facts — orders backdated to prior months need idempotent reloads or partition merges.
SELECT * in scheduled jobs — scans every column; project only needed fields.
Skipping test queries on staging — a bad join in dbt silently doubles KPIs in production dashboards.

Production checklist

Document grain for every fact table (one row per what?).
Define SCD strategy per dimension before historical reports ship.
Partition large tables by a filter column analysts always use (usually date).
Set freshness SLAs and alert when pipeline lag exceeds them.
Run row-count and key-uniqueness tests on every load.
Register tables in a catalog with owner, description, and PII classification.
Centralize business metrics in a semantic layer — not per-dashboard SQL.
Review top 10 expensive queries monthly; add clustering or rollups.
Keep bronze raw data for reprocessing when business rules change.
Plan retention and legal holds — lakes grow forever unless you prune.

Key takeaways

Warehouses serve curated OLAP; lakes store cheap raw files; lakehouses add ACID and governance on lake storage.
Star schemas with fact and dimension tables remain the default BI-friendly warehouse design.
Separate analytical storage from OLTP — load via ETL/ELT or CDC, never heavy scan on prod.
Cost and speed hinge on partition pruning, clustering, and avoiding full-table scans.
Trust requires catalogs, metric definitions, pipeline tests, and documented table grain.