Explainer · 7 June 2026

How database query execution plans work

You write SELECT in SQL; the database does not execute it literally line by line. A query optimizer rewrites your request into a physical execution plan — a tree of operators like sequential scans, index seeks, hash joins, and sorts — chosen to minimize estimated cost. When a query suddenly runs ten times slower after a data growth spurt, the plan almost always changed. Learning to read EXPLAIN output is how you stop guessing and start fixing.

From SQL text to operator tree

Processing happens in stages. The parser checks syntax and builds a parse tree. The analyzer resolves table and column names against the catalog, infers types, and expands views. The rewriter applies rules — folding constants, pushing predicates into subqueries, replacing views with base-table equivalents.

The optimizer then explores equivalent logical plans: which join order, which access path per table, whether to sort early or late. Each logical choice maps to physical operators with measurable costs. PostgreSQL calls the result a plan node tree; MySQL's EXPLAIN shows a flatter row-per-table view; SQLite prints an indented tree. Different shapes, same idea: a recipe the executor follows row batch by row batch.

Indexes are not magic speed buttons — they are alternate access paths the optimizer prices against sequential scans. Our database indexing guide covers B-tree structure and when indexes help; this article covers how the engine chooses among those paths once indexes exist.

Table access: sequential scan vs index seek

A sequential scan (Seq Scan, Table Scan) reads every row in heap order. Cost grows linearly with table size. It wins when you need most of the table anyway — SELECT * FROM orders WHERE status = 'shipped' on a table where 80% of rows are shipped — or when the table is tiny enough that random I/O for an index lookup costs more than one sequential pass.

An index scan walks a B-tree (or other index structure) to find matching keys, then fetches heap tuples by row ID. An index-only scan is cheaper still: if all requested columns live in the index leaf pages and the visibility map says heap fetches are unnecessary, the engine never touches the main table. That is why covering indexes — indexes that include every column in the SELECT and WHERE — can collapse latency by orders of magnitude.

Bitmap index scans (PostgreSQL) combine multiple index conditions: probe each index, bitmap-AND the row sets, then fetch matching heap pages in physical order to reduce random I/O. They appear when selective predicates exist but no single index covers the filter well.

Columnar and LSM-tree engines (ClickHouse, RocksDB-backed stores) use different access patterns — zone maps, bloom filters, compaction tiers — but the optimizer still asks the same question: how many rows will this operator emit, and at what I/O cost?

Join algorithms and why order matters

Joining tables A and B requires matching rows on a key. Three algorithms dominate relational engines:

Nested-loop join — for each row in the outer table, scan the inner for matches. Simple; catastrophic if the inner side lacks an index and both tables are large. Fine when the outer is tiny (five rows) or the inner is indexed.
Hash join — build an in-memory hash table on the smaller relation, probe with the larger. Excellent for equality joins at scale when the hash fits in work_mem. Spills to disk if not — watch for Hash Join followed by Disk in plans.
Merge join — both inputs sorted on the join key, merge like zipper teeth. Needs presorted inputs or explicit sort nodes upfront; shines for range joins and already-ordered clustered indexes.

With three tables, join order explodes combinatorially. A query joining users, orders, and payments has six permutations; ten tables have millions. Optimizers use dynamic programming (PostgreSQL) or greedy heuristics (MySQL) to cap search time. A bad join order — putting the largest unfiltered table in the middle — can multiply intermediate row counts and swamp memory.

Predicate pushdown is the optimizer's best friend: filter each table before joining so intermediate results stay small. If your plan shows a join producing millions of rows followed by a filter, the predicate was not pushed — often a symptom of a function on an indexed column (WHERE DATE(created_at) = ...) that prevents index use.

Cost models and cardinality estimates

Cost-based optimizers assign each operator a numeric cost — usually abstract units blending CPU and I/O, not milliseconds. PostgreSQL's seq_page_cost and random_page_cost GUC parameters tune how harshly random index lookups are penalized relative to sequential reads. SSD deployments often lower random_page_cost because random I/O is cheaper than on spinning disks.

Costs multiply by cardinality estimates: how many rows each operator outputs. Estimates come from table statistics — row counts, distinct value counts, histograms of value frequency, correlation between columns. When statistics lie, plans lie. Classic failure modes:

Stale stats after a bulk load — run ANALYZE.
Uniform distribution assumed where data is skewed — one user_id owns 40% of events; the optimizer expects 0.01%.
Correlated predicates treated as independent — city = 'Paris' AND country = 'France' estimated as product of separate selectivities.
Parameterized queries with peeking disabled — first-seen parameter values poison plan cache (less common on modern PostgreSQL with adaptive plans).

EXPLAIN (ANALYZE, BUFFERS) runs the query and compares estimated rows to actual rows at each node. A tenfold underestimate on a nested-loop inner side is a smoking gun. Extended statistics (multivariate histograms, expression indexes) exist precisely to fix these gaps.

Sorting, aggregation, and parallelism

ORDER BY, GROUP BY, and DISTINCT often add Sort or HashAggregate nodes. Sorts spill to disk when work_mem is exceeded — Sort Method: external merge in PostgreSQL output. Raising work_mem globally is dangerous (each sort node can allocate that much); per-session tuning for heavy reports is safer.

Hash aggregate builds a hash table keyed by group columns — O(n) memory if groups fit, otherwise spills. Sort + GroupAggregate sorts first then scans for group boundaries — better when groups are few and ordering is already required.

Parallel query plans add Gather nodes: workers scan or join partitions concurrently, coordinator merges. Parallelism helps large sequential scans and hash joins; it hurts tiny queries where process startup dominates. Index probes are often serial because random access does not shard cleanly.

Transactions, locks, and plan stability

Execution plans interact with concurrency. A plan that uses an index may still block on row-level locks held by another transaction's update. Long-running scans under REPEATABLE READ or SERIALIZABLE isolation can trigger serialization failures or hold snapshots that prevent vacuum cleanup — bloat that eventually slows every scan.

ACID transactions guarantee correctness; they do not guarantee performance. An index-friendly point lookup inside a short transaction beats a perfect plan running inside a transaction that holds locks for thirty seconds.

Prepared statements and plan caching reuse compiled plans across executions. Usually beneficial, but a plan compiled for an empty table may choose nested loops that fail at production scale. Some teams use pg_stat_statements to find regressions after deploys; others force plan invalidation on major migrations.

A practical tuning workflow

Identify the slow query from logs or pg_stat_statements / Performance Schema — median and p99, not one-off dev runs.
Run EXPLAIN (ANALYZE, BUFFERS) with production-like data volume and parameters.
Compare estimated vs actual row counts; fix statistics before adding indexes.
Look for sequential scans on large tables with selective predicates — candidate for indexes or rewrite.
Check join types and order; verify the smallest filtered table drives the join.
Confirm sorts and hash builds stay in memory; tune work_mem per session if needed.
Re-measure under concurrent load — a plan that wins in isolation may contend on hot pages.

Blockchain indexers face the same pattern at scale: ingesting millions of account updates into Postgres or ClickHouse, then serving wallet history queries. The on-chain program is deterministic; the off-chain database is where latency hides. Execution-plan literacy separates indexers that feel instant from those that time out on popular wallets.