Explainer · 7 June 2026
How database query execution plans work
You write SELECT in SQL; the database does not execute it literally
line by line. A query optimizer rewrites your request into a
physical execution plan — a tree of operators like sequential
scans, index seeks, hash joins, and sorts — chosen to minimize estimated cost.
When a query suddenly runs ten times slower after a data growth spurt, the plan
almost always changed. Learning to read EXPLAIN output is how you
stop guessing and start fixing.
From SQL text to operator tree
Processing happens in stages. The parser checks syntax and builds a parse tree. The analyzer resolves table and column names against the catalog, infers types, and expands views. The rewriter applies rules — folding constants, pushing predicates into subqueries, replacing views with base-table equivalents.
The optimizer then explores equivalent logical plans: which
join order, which access path per table, whether to sort early or late. Each
logical choice maps to physical operators with measurable
costs. PostgreSQL calls the result a plan node tree; MySQL's
EXPLAIN shows a flatter row-per-table view; SQLite prints an
indented tree. Different shapes, same idea: a recipe the executor follows
row batch by row batch.
Indexes are not magic speed buttons — they are alternate access paths the optimizer prices against sequential scans. Our database indexing guide covers B-tree structure and when indexes help; this article covers how the engine chooses among those paths once indexes exist.
Table access: sequential scan vs index seek
A sequential scan (Seq Scan, Table Scan) reads every row in
heap order. Cost grows linearly with table size. It wins when you need most of
the table anyway — SELECT * FROM orders WHERE status = 'shipped'
on a table where 80% of rows are shipped — or when the table is tiny enough
that random I/O for an index lookup costs more than one sequential pass.
An index scan walks a B-tree (or other index structure) to find
matching keys, then fetches heap tuples by row ID. An index-only
scan is cheaper still: if all requested columns live in the index
leaf pages and the visibility map says heap fetches are unnecessary, the engine
never touches the main table. That is why covering indexes — indexes that
include every column in the SELECT and WHERE — can
collapse latency by orders of magnitude.
Bitmap index scans (PostgreSQL) combine multiple index conditions: probe each index, bitmap-AND the row sets, then fetch matching heap pages in physical order to reduce random I/O. They appear when selective predicates exist but no single index covers the filter well.
Columnar and LSM-tree engines (ClickHouse, RocksDB-backed stores) use different access patterns — zone maps, bloom filters, compaction tiers — but the optimizer still asks the same question: how many rows will this operator emit, and at what I/O cost?
Join algorithms and why order matters
Joining tables A and B requires matching rows on a key. Three algorithms dominate relational engines:
- Nested-loop join — for each row in the outer table, scan the inner for matches. Simple; catastrophic if the inner side lacks an index and both tables are large. Fine when the outer is tiny (five rows) or the inner is indexed.
- Hash join — build an in-memory hash table on the smaller
relation, probe with the larger. Excellent for equality joins at scale when the
hash fits in
work_mem. Spills to disk if not — watch forHash Joinfollowed byDiskin plans. - Merge join — both inputs sorted on the join key, merge like zipper teeth. Needs presorted inputs or explicit sort nodes upfront; shines for range joins and already-ordered clustered indexes.
With three tables, join order explodes combinatorially. A query joining
users, orders, and payments has six
permutations; ten tables have millions. Optimizers use dynamic programming
(PostgreSQL) or greedy heuristics (MySQL) to cap search time. A bad join order
— putting the largest unfiltered table in the middle — can multiply intermediate
row counts and swamp memory.
Predicate pushdown is the optimizer's best friend: filter each
table before joining so intermediate results stay small. If your plan shows a
join producing millions of rows followed by a filter, the predicate was not
pushed — often a symptom of a function on an indexed column
(WHERE DATE(created_at) = ...) that prevents index use.
Cost models and cardinality estimates
Cost-based optimizers assign each operator a numeric cost — usually abstract
units blending CPU and I/O, not milliseconds. PostgreSQL's
seq_page_cost and random_page_cost GUC parameters
tune how harshly random index lookups are penalized relative to sequential
reads. SSD deployments often lower random_page_cost because
random I/O is cheaper than on spinning disks.
Costs multiply by cardinality estimates: how many rows each operator outputs. Estimates come from table statistics — row counts, distinct value counts, histograms of value frequency, correlation between columns. When statistics lie, plans lie. Classic failure modes:
- Stale stats after a bulk load — run
ANALYZE. - Uniform distribution assumed where data is skewed — one
user_idowns 40% of events; the optimizer expects 0.01%. - Correlated predicates treated as independent —
city = 'Paris' AND country = 'France'estimated as product of separate selectivities. - Parameterized queries with peeking disabled — first-seen parameter values poison plan cache (less common on modern PostgreSQL with adaptive plans).
EXPLAIN (ANALYZE, BUFFERS) runs the query and compares estimated
rows to actual rows at each node. A tenfold underestimate on a nested-loop
inner side is a smoking gun. Extended statistics (multivariate histograms,
expression indexes) exist precisely to fix these gaps.
Sorting, aggregation, and parallelism
ORDER BY, GROUP BY, and DISTINCT often
add Sort or HashAggregate nodes. Sorts spill
to disk when work_mem is exceeded — Sort Method: external
merge in PostgreSQL output. Raising work_mem globally is
dangerous (each sort node can allocate that much); per-session tuning for heavy
reports is safer.
Hash aggregate builds a hash table keyed by group columns — O(n) memory if groups fit, otherwise spills. Sort + GroupAggregate sorts first then scans for group boundaries — better when groups are few and ordering is already required.
Parallel query plans add Gather nodes: workers scan or join
partitions concurrently, coordinator merges. Parallelism helps large sequential
scans and hash joins; it hurts tiny queries where process startup dominates.
Index probes are often serial because random access does not shard cleanly.
Transactions, locks, and plan stability
Execution plans interact with concurrency. A plan that uses an index may still
block on row-level locks held by another transaction's update. Long-running
scans under REPEATABLE READ or SERIALIZABLE isolation
can trigger serialization failures or hold snapshots that prevent vacuum cleanup
— bloat that eventually slows every scan.
ACID transactions guarantee correctness; they do not guarantee performance. An index-friendly point lookup inside a short transaction beats a perfect plan running inside a transaction that holds locks for thirty seconds.
Prepared statements and plan caching reuse compiled plans across
executions. Usually beneficial, but a plan compiled for an empty table may
choose nested loops that fail at production scale. Some teams use
pg_stat_statements to find regressions after deploys; others force
plan invalidation on major migrations.
A practical tuning workflow
- Identify the slow query from logs or
pg_stat_statements/ Performance Schema — median and p99, not one-off dev runs. - Run
EXPLAIN (ANALYZE, BUFFERS)with production-like data volume and parameters. - Compare estimated vs actual row counts; fix statistics before adding indexes.
- Look for sequential scans on large tables with selective predicates — candidate for indexes or rewrite.
- Check join types and order; verify the smallest filtered table drives the join.
- Confirm sorts and hash builds stay in memory; tune
work_memper session if needed. - Re-measure under concurrent load — a plan that wins in isolation may contend on hot pages.
Blockchain indexers face the same pattern at scale: ingesting millions of account updates into Postgres or ClickHouse, then serving wallet history queries. The on-chain program is deterministic; the off-chain database is where latency hides. Execution-plan literacy separates indexers that feel instant from those that time out on popular wallets.
Related on Solana Garden: Database indexing explained, ACID database transactions explained, LSM trees explained, Application caching with Redis explained, Explainers hub.