Explainer · 7 June 2026
How B-trees and B+ trees work
When a relational database promises fast lookups on a sorted key — user ID,
timestamp, wallet address — the index underneath is almost always a
B+ tree. Not a binary search tree that grows tall and cache-hostile,
not a hash table that forgets order, but a wide, shallow tree whose
nodes are sized to match a disk page. Understanding that structure explains why
WHERE id = 7 is cheap, why WHERE created_at BETWEEN ...
works at all, and why the wrong index shape can slow writes without helping reads.
Why binary trees fail on disk
A classic binary search tree stores one key per node with left and right child pointers. In memory, that is fine — each comparison is nanoseconds. On spinning disks or even NVMe accessed through a buffer pool, the dominant cost is fetching a page (typically 4–16 KB), not comparing integers inside it.
A tall skinny tree means many page faults per lookup. A million keys in a balanced binary tree needs about 20 levels — up to 20 random reads. Random I/O is what databases spend their lives avoiding. The B-tree family (Bayer and McCreight, 1972) fixes this by packing hundreds of keys into each node so the tree stays very shallow: a billion-row index might be only three or four levels deep.
The "B" originally stood for Bayer; practitioners often say "balanced" because every leaf sits at the same depth and the tree self-balances through splits and merges rather than explicit rotations like red-black trees. Linux's Completely Fair Scheduler uses a red-black tree in RAM; your Postgres primary index does not — different hardware assumptions, different structure.
Node layout and the order parameter
A B-tree node has up to m - 1 keys and m child
pointers, where m is the order (fanout). Nodes
are kept between half-full and completely full (except the root). Invariant:
all keys in the left subtree are smaller than the separator, all keys in the
right subtree are larger.
On disk, m is not chosen for algorithmic elegance — it is chosen so
a node fits exactly in one page. InnoDB's default 16 KB page
might hold a few hundred 8-byte integers plus child pointers; Postgres B-tree
pages are 8 KB by default. Wider nodes mean fewer levels but more work per
node (binary search within the page is still cheap compared to I/O).
Search walks from root to leaf: at each node, binary-search the
key array to pick the child pointer, repeat until you hit a leaf. Cost is
O(logm n) comparisons, but only O(tree height) page
reads — and height is tiny.
B-tree vs B+ tree: where the data lives
In a classic B-tree, keys and their associated values (row pointers, payloads) can appear in internal nodes as well as leaves. In a B+ tree, internal nodes carry only keys for routing; all actual records or row pointers sit in the leaves. Internal nodes are pure signposts.
That split matters for two reasons:
- Higher fanout — Internal nodes do not waste space on values, so each page holds more separators and the tree is even shallower.
- Leaf linked lists — B+ tree leaves link to their neighbor at the same level. Once you find the first key in a range, you scan forward along the leaf chain without climbing back up the tree.
Range queries — price > 100 ORDER BY price LIMIT 50, time-series
windows, pagination by cursor — are why SQL defaults to B+ trees. Hash indexes
answer equality only; they cannot walk sorted order. For how optimizers choose
between scan types, see
query execution plans;
for practical index design, see
database indexing.
Insert, split, delete, merge
Insert finds the target leaf (same search path), appends the key in sorted order. If the leaf is full, it splits: the node divides into two half-full siblings, and the middle key promotes to the parent. If the parent fills up, it splits too — possibly propagating to the root. A root split grows the tree by one level. Splits are local; you never rewrite the whole tree.
Delete removes a key from a leaf. If the node drops below half full, the tree rebalances: borrow a key from a sibling, or merge two underfull siblings and remove the separator from the parent. Merges can cascade upward just like splits.
Write amplification is real: inserting one row may touch the leaf, a parent, and sometimes ancestors — each touch is a dirty page eventually flushed to disk. That is why heavy indexing slows INSERT/UPDATE throughput and why LSM trees (append-only memtables + background compaction) win on write-heavy workloads even though B+ trees still dominate OLTP defaults.
Clustered vs secondary indexes
InnoDB's clustered index is the table: leaf pages store the full row keyed by primary key. Secondary indexes store (secondary_key → primary_key) in their own B+ trees; fetching columns not in the index requires a bookmark lookup back to the clustered tree — two tree walks.
Postgres treats the heap and indexes separately: indexes store tuple identifiers (block, offset) pointing into table pages. A "covering index" includes all columns the query needs so the planner can satisfy the query from the index alone — an index-only scan avoids heap fetches when visibility map bits allow it.
Both engines still use B+ tree pages; the difference is what sits in the leaf. Understanding that distinction prevents surprises when EXPLAIN shows "Index Scan" followed by "Heap Fetch" on millions of rows.
B+ trees vs other ordered structures
Hash tables (hash maps)
give O(1) expected equality but no ordering. Use them when the query is only
= and the engine offers a hash index (Postgres hash indexes are
rare in practice; in-memory hash joins are a different story).
Skip lists provide O(log n) sorted access with probabilistic towers instead of splits — Redis sorted sets use them in memory. On disk, pointer-chasing through skip-list towers is less page-friendly than wide B+ nodes.
Tries excel at prefix keys (autocomplete, IP routing) but use more memory per key unless compressed — prefix trees complement rather than replace B+ trees for general relational keys.
LSM trees batch writes into sequential logs and merge sorted runs later — better write throughput, more complex read path (check memtable, then multiple SSTable levels). RocksDB, LevelDB, and many time-series stores use LSM; traditional RDBMS row stores stick with B+ trees for predictable read latency.
What operators actually store in pages
Real implementations add decades of engineering:
- Key compression — Postgres can deduplicate prefix bytes between adjacent keys on a page; InnoDB has similar tricks.
- Fillfactor — Leaves created only 90% full leave room for hot inserts without immediate splits (trade: slightly larger index).
- Version columns / MVCC — Postgres index entries point to row versions; vacuum reclaims dead tuples. B+ structure unchanged; churn affects bloat.
- Partial and expression indexes — Still B+ trees, just over a subset or computed key — the structure is the same, the key extractor differs.
When a query is slow, the fix is rarely "understand B-trees better" alone — you
need EXPLAIN, cardinality estimates, and whether the index matches the predicate.
But when the planner says Index Scan using idx_users_email, this is
the machinery underneath: shallow tree, sorted leaves, a handful of page reads.
Operational takeaways
- Match index key order to query predicates — Compound indexes are left-prefix;
(a, b)helpsWHERE a = ?but notWHERE b = ?alone. - Expect write cost per index — Each extra B+ tree must update on insert/delete; batch writes when possible.
- Monitor bloat and page splits — Autovacuum (Postgres) or optimize table (InnoDB) matter because logical deletes leave empty slots until maintenance.
- Choose LSM consciously for write-heavy stores — Do not assume every database should mimic RocksDB; OLTP read latency often favors B+ trees.
Related on Solana Garden: Database indexing explained, Query execution plans, LSM trees, Skip lists, Explainers hub.