Guide

DuckDB explained

Your data team exports three months of order logs to Parquet on S3. Marketing wants cohort retention by channel; finance wants margin by SKU. Spinning up a Spark cluster for a one-off question is overkill, loading everything into PostgreSQL will choke on billion-row aggregations, and SQLite is built for row-oriented OLTP, not scanning gigabytes of history. DuckDB is an embedded OLAP (online analytical processing) database that runs in-process like SQLite but executes analytical SQL with a columnar, vectorized engine — often orders of magnitude faster on aggregations, joins across wide fact tables, and queries that read Parquet or CSV files directly without an import step. This guide covers DuckDB's architecture, when it beats client-server warehouses and embedded OLTP engines, file-backed and in-memory modes, Python and notebook integration, extension ecosystem, a Harbor Supply order-analytics worked example, a database decision table, common pitfalls, and a practitioner checklist alongside our data warehouse overview and ETL/ELT pipeline guide.

What DuckDB is — and what it is not

DuckDB is a self-contained analytical SQL engine you link into your application, Jupyter notebook, or CLI. Like SQLite, there is no separate server process: you open a connection, run SQL, and close. Unlike SQLite, DuckDB stores data column-wise and processes queries in vectorized batches (typically 1,024 or 2,048 values at a time), which is how modern OLAP systems achieve scan throughput on analytical workloads.

DuckDB is not a replacement for PostgreSQL as your production OLTP database. It excels at read-heavy analytics — aggregations, window functions, star-schema joins, and ad hoc exploration — not at thousands of concurrent small writes from a web API. It is also not a distributed query engine: a single DuckDB process uses one machine's CPU and RAM. For petabyte-scale lakehouse queries across a cluster, you still need Spark, Trino, or a cloud warehouse — but for laptop-to-single-server analytics on gigabytes to low terabytes, DuckDB often wins on simplicity and latency.

OLTP vs OLAP in one sentence

OLTP (SQLite, Postgres for apps) optimizes many small point lookups and inserts: "fetch order #48291." OLAP (DuckDB, Snowflake, BigQuery) optimizes scanning and summarizing huge datasets: "average order value by cohort for the last 90 days." DuckDB sits firmly in the OLAP camp while remaining embeddable.

Columnar storage and vectorized execution

Row-oriented databases store each record contiguously: row 1's id, name, and timestamp sit together on disk. Analytical queries that sum one column still read every column in the row — wasted I/O. Columnar storage groups all values of a single column together, so SELECT SUM(revenue) touches only the revenue column's pages.

Vectorized execution means operators (filter, aggregate, join) work on arrays of values using CPU-friendly loops and SIMD instructions rather than interpreting one tuple at a time. Combined with late materialization (carry row identifiers through filters, fetch other columns only for surviving rows), DuckDB routinely outperforms row stores by 10–100× on typical BI queries.

Compression and zone maps

Columnar layouts compress well — repeated category codes, sorted timestamps, and low-cardinality enums shrink dramatically with dictionary and run-length encoding. DuckDB also maintains min/max statistics per chunk (zone maps), skipping entire blocks when a WHERE order_date > '2026-01-01' predicate cannot match. These are the same ideas behind Parquet file statistics and warehouse micro-partitions, implemented in an in-process engine.

Querying files in place: Parquet, CSV, JSON

DuckDB's killer feature for data engineers is direct file querying. You do not need a LOAD or COPY step before analysis:

SELECT region, SUM(amount) AS revenue
FROM read_parquet('s3://harbor-exports/orders/2026-03/*.parquet')
WHERE status = 'shipped'
GROUP BY region
ORDER BY revenue DESC;

The engine pushes filters into the Parquet reader, reads only relevant row groups, and streams results. CSV and JSON readers exist with auto-detection of headers and types; for repeated access, materializing into a native DuckDB table or exporting back to Parquet is faster than reparsing CSV every run.

Remote object storage

The httpfs and aws extensions let DuckDB read from S3, GCS, and HTTP URLs with parallel range requests. Credentials follow the usual environment-variable chain (AWS_ACCESS_KEY_ID, instance profiles). For large lakes, partition pruning by path (year=2026/month=03/) combined with Parquet statistics minimizes bytes transferred — critical when egress is billed per gigabyte.

Persistent vs in-memory databases

:memory: is the default for ephemeral notebook sessions. duckdb my_analytics.duckdb creates a single-file database similar to SQLite, persisting tables and indexes across restarts. Many teams use in-memory for exploration and write results to Parquet on disk as the durable artifact — a pattern that pairs well with ELT pipelines where the lake is source of truth.

SQL features that matter for analytics

DuckDB speaks ANSI SQL with a rich analytical surface:

Window functions — running totals, LAG/LEAD, cohort ranks, and NTILE deciles without self-joins.
Advanced aggregations — LIST, ARRAY_AGG, QUANTILE, APPROX_COUNT_DISTINCT for funnel and distribution analysis.
ASOF joins — match each order to the exchange rate effective at order time without a messy inequality join.
Pivot and unpivot — reshape wide marketing spreadsheets into tidy tables in one statement.
Nested types — STRUCT, LIST, and MAP columns for semi-structured event payloads; query with dot notation and UNNEST.

The query planner includes a cost-based optimizer with join reordering and parallel execution across CPU cores. For SQL syntax basics and optimization habits that transfer directly, see those guides — DuckDB rewards the same index-aware thinking when you do persist tables.

Python, R, and the notebook workflow

The duckdb Python package is the primary interface for data scientists. Two patterns dominate:

import duckdb

# Query a pandas DataFrame without copying data
df = duckdb.sql("SELECT category, AVG(price) FROM df GROUP BY 1").df()

# Register a PyArrow table for zero-copy scans
duckdb.register("events", arrow_table)
duckdb.sql("SELECT COUNT(*) FROM events WHERE event = 'purchase'")

The rel API builds lazy query graphs; .df(), .arrow(), and .pl() export to pandas, PyArrow, or Polars. In Jupyter, %sql magics via duckdb-engine let analysts write SQL cells that reference Python variables. R users get similar bindings through duckdb::duckdb() and DBI.

Integration with the modern data stack

DuckDB plugs into dbt via the dbt-duckdb adapter for local model development before promoting to Snowflake or BigQuery. Tools like MotherDuck add a hosted coordination layer for shared DuckDB databases across a team — useful when "works on my laptop" needs a managed sync path without jumping straight to a full cloud warehouse.

Worked example: Harbor Supply order analytics

Harbor Supply exports nightly order Parquet partitions to s3://harbor-analytics/orders/ with schema (order_id, customer_id, sku, quantity, unit_price, channel, order_ts). The BI team needs weekly revenue by channel and a repeat-purchase rate for Q1 without provisioning a warehouse.

INSTALL httpfs; LOAD httpfs;
SET s3_region = 'us-east-1';

CREATE TABLE orders AS
  SELECT * FROM read_parquet('s3://harbor-analytics/orders/2026-0[1-3]/**/*.parquet');

-- Weekly revenue by channel
SELECT date_trunc('week', order_ts) AS week,
       channel,
       SUM(quantity * unit_price) AS revenue
FROM orders
GROUP BY 1, 2
ORDER BY 1, 3 DESC;

-- Repeat purchase rate: customers with 2+ orders in Q1
WITH customer_counts AS (
  SELECT customer_id, COUNT(*) AS n
  FROM orders
  GROUP BY 1
)
SELECT AVG(CASE WHEN n >= 2 THEN 1.0 ELSE 0.0 END) AS repeat_rate
FROM customer_counts;

The entire analysis runs on a 16 GB laptop in under a minute for 40 million rows because DuckDB parallelizes the scan and aggregation. Results export to CSV for the finance team's Excel models. When the dataset grows past single-node RAM, Harbor promotes the dbt models to their Snowflake warehouse — same SQL dialect family, different scale tier.

Decision table: DuckDB vs alternatives

Need	Reach for	Why
Ad hoc analytics on Parquet/CSV locally	DuckDB	In-process, no cluster, file queries without import
Mobile app / edge OLTP, single-writer	SQLite	Row-oriented, battle-tested embedded OLTP, tiny footprint
Production web API with concurrent writes	PostgreSQL	MVCC, replication, mature ecosystem, OLTP-tuned
Petabyte lakehouse, multi-tenant BI at scale	Snowflake / BigQuery / Spark	Distributed storage and compute, governance, concurrency
In-memory caching and session store	Redis	Sub-millisecond key access, not an SQL analytics engine
Stream processing with replayable log	Kafka + Flink/Spark	Continuous ingestion; DuckDB is batch/interactive, not a stream processor

Common pitfalls

Using DuckDB as your app's primary database. High-concurrency small writes and strict durability requirements belong in Postgres. DuckDB handles concurrent reads well but is not designed for thousands of OLTP connections.
Out-of-memory on huge joins. DuckDB uses disk spill for large intermediates, but a cartesian join on two 100 GB tables will still hurt. Filter early, pick sensible join keys, and monitor max_memory settings.
Re-scanning CSV every run. CSV parsing is slow and type-inference is fragile. Convert to Parquet once; query Parquet thereafter.
Ignoring file layout. Unpartitioned billion-row Parquet with tiny row groups defeats predicate pushdown. Partition by date, sort by filter columns, target 128 MB–512 MB row groups.
Assuming full Snowflake compatibility. Most analytical SQL transfers; proprietary UDFs, governance features, and elastic scale do not. Test dbt models locally before promotion.
Storing secrets in SQL scripts. Use environment variables or IAM roles for S3 credentials — same hygiene as secrets management elsewhere.

Practitioner checklist

Classify the workload: OLAP exploration and reporting → DuckDB is a strong candidate; OLTP serving → use Postgres or SQLite.
Prefer Parquet as the interchange format; partition by time or tenant.
Enable httpfs (or cloud extension) for remote files; verify credentials and region.
Start with read_parquet / read_csv_auto; persist to a .duckdb file only when reuse justifies storage.
Use EXPLAIN ANALYZE to confirm filters push down and joins parallelize.
Export results to Parquet or Arrow for downstream dashboards; do not treat the notebook as the system of record.
Plan the graduation path: when data or concurrency exceeds one machine, promote models to a distributed warehouse without rewriting all SQL.