Guide
DuckDB explained
Your data team exports three months of order logs to Parquet on S3. Marketing wants cohort retention by channel; finance wants margin by SKU. Spinning up a Spark cluster for a one-off question is overkill, loading everything into PostgreSQL will choke on billion-row aggregations, and SQLite is built for row-oriented OLTP, not scanning gigabytes of history. DuckDB is an embedded OLAP (online analytical processing) database that runs in-process like SQLite but executes analytical SQL with a columnar, vectorized engine — often orders of magnitude faster on aggregations, joins across wide fact tables, and queries that read Parquet or CSV files directly without an import step. This guide covers DuckDB's architecture, when it beats client-server warehouses and embedded OLTP engines, file-backed and in-memory modes, Python and notebook integration, extension ecosystem, a Harbor Supply order-analytics worked example, a database decision table, common pitfalls, and a practitioner checklist alongside our data warehouse overview and ETL/ELT pipeline guide.
What DuckDB is — and what it is not
DuckDB is a self-contained analytical SQL engine you link into your application, Jupyter notebook, or CLI. Like SQLite, there is no separate server process: you open a connection, run SQL, and close. Unlike SQLite, DuckDB stores data column-wise and processes queries in vectorized batches (typically 1,024 or 2,048 values at a time), which is how modern OLAP systems achieve scan throughput on analytical workloads.
DuckDB is not a replacement for PostgreSQL as your production OLTP database. It excels at read-heavy analytics — aggregations, window functions, star-schema joins, and ad hoc exploration — not at thousands of concurrent small writes from a web API. It is also not a distributed query engine: a single DuckDB process uses one machine's CPU and RAM. For petabyte-scale lakehouse queries across a cluster, you still need Spark, Trino, or a cloud warehouse — but for laptop-to-single-server analytics on gigabytes to low terabytes, DuckDB often wins on simplicity and latency.
OLTP vs OLAP in one sentence
OLTP (SQLite, Postgres for apps) optimizes many small point lookups and inserts: "fetch order #48291." OLAP (DuckDB, Snowflake, BigQuery) optimizes scanning and summarizing huge datasets: "average order value by cohort for the last 90 days." DuckDB sits firmly in the OLAP camp while remaining embeddable.
Columnar storage and vectorized execution
Row-oriented databases store each record contiguously: row 1's id, name, and
timestamp sit together on disk. Analytical queries that sum one column still
read every column in the row — wasted I/O. Columnar storage groups all values
of a single column together, so SELECT SUM(revenue) touches only
the revenue column's pages.
Vectorized execution means operators (filter, aggregate, join) work on arrays of values using CPU-friendly loops and SIMD instructions rather than interpreting one tuple at a time. Combined with late materialization (carry row identifiers through filters, fetch other columns only for surviving rows), DuckDB routinely outperforms row stores by 10–100× on typical BI queries.
Compression and zone maps
Columnar layouts compress well — repeated category codes, sorted timestamps,
and low-cardinality enums shrink dramatically with dictionary and run-length
encoding. DuckDB also maintains min/max statistics per chunk
(zone maps), skipping entire blocks when a WHERE order_date > '2026-01-01'
predicate cannot match. These are the same ideas behind Parquet file statistics
and warehouse micro-partitions, implemented in an in-process engine.
Querying files in place: Parquet, CSV, JSON
DuckDB's killer feature for data engineers is direct file querying.
You do not need a LOAD or COPY step before analysis:
SELECT region, SUM(amount) AS revenue
FROM read_parquet('s3://harbor-exports/orders/2026-03/*.parquet')
WHERE status = 'shipped'
GROUP BY region
ORDER BY revenue DESC;
The engine pushes filters into the Parquet reader, reads only relevant row groups, and streams results. CSV and JSON readers exist with auto-detection of headers and types; for repeated access, materializing into a native DuckDB table or exporting back to Parquet is faster than reparsing CSV every run.
Remote object storage
The httpfs and aws extensions let DuckDB read from
S3, GCS, and HTTP URLs with parallel range requests. Credentials follow the
usual environment-variable chain (AWS_ACCESS_KEY_ID, instance
profiles). For large lakes, partition pruning by path (year=2026/month=03/)
combined with Parquet statistics minimizes bytes transferred — critical when
egress is billed per gigabyte.
Persistent vs in-memory databases
:memory: is the default for ephemeral notebook sessions.
duckdb my_analytics.duckdb creates a single-file database similar
to SQLite, persisting tables and indexes across restarts. Many teams use
in-memory for exploration and write results to Parquet on disk as the
durable artifact — a pattern that pairs well with
ELT pipelines
where the lake is source of truth.
SQL features that matter for analytics
DuckDB speaks ANSI SQL with a rich analytical surface:
- Window functions — running totals,
LAG/LEAD, cohort ranks, andNTILEdeciles without self-joins. - Advanced aggregations —
LIST,ARRAY_AGG,QUANTILE,APPROX_COUNT_DISTINCTfor funnel and distribution analysis. - ASOF joins — match each order to the exchange rate effective at order time without a messy inequality join.
- Pivot and unpivot — reshape wide marketing spreadsheets into tidy tables in one statement.
- Nested types —
STRUCT,LIST, andMAPcolumns for semi-structured event payloads; query with dot notation andUNNEST.
The query planner includes a cost-based optimizer with join reordering and parallel execution across CPU cores. For SQL syntax basics and optimization habits that transfer directly, see those guides — DuckDB rewards the same index-aware thinking when you do persist tables.
Python, R, and the notebook workflow
The duckdb Python package is the primary interface for data
scientists. Two patterns dominate:
import duckdb
# Query a pandas DataFrame without copying data
df = duckdb.sql("SELECT category, AVG(price) FROM df GROUP BY 1").df()
# Register a PyArrow table for zero-copy scans
duckdb.register("events", arrow_table)
duckdb.sql("SELECT COUNT(*) FROM events WHERE event = 'purchase'")
The rel API builds lazy query graphs; .df(),
.arrow(), and .pl() export to pandas, PyArrow, or
Polars. In Jupyter, %sql magics via duckdb-engine
let analysts write SQL cells that reference Python variables. R users get
similar bindings through duckdb::duckdb() and DBI.
Integration with the modern data stack
DuckDB plugs into dbt via the dbt-duckdb adapter for local model
development before promoting to Snowflake or BigQuery. Tools like MotherDuck
add a hosted coordination layer for shared DuckDB databases across a team —
useful when "works on my laptop" needs a managed sync path without jumping
straight to a full cloud warehouse.
Worked example: Harbor Supply order analytics
Harbor Supply exports nightly order Parquet partitions to
s3://harbor-analytics/orders/ with schema
(order_id, customer_id, sku, quantity, unit_price, channel, order_ts).
The BI team needs weekly revenue by channel and a repeat-purchase rate for Q1
without provisioning a warehouse.
INSTALL httpfs; LOAD httpfs;
SET s3_region = 'us-east-1';
CREATE TABLE orders AS
SELECT * FROM read_parquet('s3://harbor-analytics/orders/2026-0[1-3]/**/*.parquet');
-- Weekly revenue by channel
SELECT date_trunc('week', order_ts) AS week,
channel,
SUM(quantity * unit_price) AS revenue
FROM orders
GROUP BY 1, 2
ORDER BY 1, 3 DESC;
-- Repeat purchase rate: customers with 2+ orders in Q1
WITH customer_counts AS (
SELECT customer_id, COUNT(*) AS n
FROM orders
GROUP BY 1
)
SELECT AVG(CASE WHEN n >= 2 THEN 1.0 ELSE 0.0 END) AS repeat_rate
FROM customer_counts;
The entire analysis runs on a 16 GB laptop in under a minute for 40 million rows because DuckDB parallelizes the scan and aggregation. Results export to CSV for the finance team's Excel models. When the dataset grows past single-node RAM, Harbor promotes the dbt models to their Snowflake warehouse — same SQL dialect family, different scale tier.
Decision table: DuckDB vs alternatives
| Need | Reach for | Why |
|---|---|---|
| Ad hoc analytics on Parquet/CSV locally | DuckDB | In-process, no cluster, file queries without import |
| Mobile app / edge OLTP, single-writer | SQLite | Row-oriented, battle-tested embedded OLTP, tiny footprint |
| Production web API with concurrent writes | PostgreSQL | MVCC, replication, mature ecosystem, OLTP-tuned |
| Petabyte lakehouse, multi-tenant BI at scale | Snowflake / BigQuery / Spark | Distributed storage and compute, governance, concurrency |
| In-memory caching and session store | Redis | Sub-millisecond key access, not an SQL analytics engine |
| Stream processing with replayable log | Kafka + Flink/Spark | Continuous ingestion; DuckDB is batch/interactive, not a stream processor |
Common pitfalls
- Using DuckDB as your app's primary database. High-concurrency small writes and strict durability requirements belong in Postgres. DuckDB handles concurrent reads well but is not designed for thousands of OLTP connections.
- Out-of-memory on huge joins. DuckDB uses disk spill for
large intermediates, but a cartesian join on two 100 GB tables will still
hurt. Filter early, pick sensible join keys, and monitor
max_memorysettings. - Re-scanning CSV every run. CSV parsing is slow and type-inference is fragile. Convert to Parquet once; query Parquet thereafter.
- Ignoring file layout. Unpartitioned billion-row Parquet with tiny row groups defeats predicate pushdown. Partition by date, sort by filter columns, target 128 MB–512 MB row groups.
- Assuming full Snowflake compatibility. Most analytical SQL transfers; proprietary UDFs, governance features, and elastic scale do not. Test dbt models locally before promotion.
- Storing secrets in SQL scripts. Use environment variables or IAM roles for S3 credentials — same hygiene as secrets management elsewhere.
Practitioner checklist
- Classify the workload: OLAP exploration and reporting → DuckDB is a strong candidate; OLTP serving → use Postgres or SQLite.
- Prefer Parquet as the interchange format; partition by time or tenant.
- Enable
httpfs(or cloud extension) for remote files; verify credentials and region. - Start with
read_parquet/read_csv_auto; persist to a.duckdbfile only when reuse justifies storage. - Use
EXPLAIN ANALYZEto confirm filters push down and joins parallelize. - Export results to Parquet or Arrow for downstream dashboards; do not treat the notebook as the system of record.
- Plan the graduation path: when data or concurrency exceeds one machine, promote models to a distributed warehouse without rewriting all SQL.
Related reading
- SQLite fundamentals explained — embedded OLTP patterns and when row storage wins
- PostgreSQL fundamentals explained — production OLTP, MVCC, and indexing
- Data warehouses and lakehouses explained — star schema, Delta Lake, and cloud OLAP
- ETL and ELT data pipelines explained — ingestion, transformation, and orchestration