Explainer · 7 June 2026
How LSM trees work
A relational database with a classic B-tree index must find the right leaf page and rewrite it on every insert — random I/O on spinning disks was the bottleneck for decades. A log-structured merge tree (LSM tree) flips the model: writes append sequentially to an in-memory buffer and later merge into immutable on-disk files. Reads get more expensive (you may consult several files), but ingest throughput soars. RocksDB, LevelDB, Cassandra, ScyllaDB, and the storage layer behind many time-series databases all share this architecture. This explainer walks through memtables, SSTables, compaction, and the amplification metrics that decide whether an LSM fits your workload.
The core idea: append writes, merge later
An LSM tree is not a single tree structure in memory like a B-tree. It is a pipeline of components:
- Write-ahead log (WAL) — every mutation is appended to a durable log file first. If the process crashes, replay the WAL to recover the memtable.
- Memtable — an in-memory sorted structure (often a skip list or red-black tree) holding recent writes. Lookups check here first.
- SSTables (sorted string tables) — when the memtable reaches a size threshold, it is frozen, flushed to disk as an immutable sorted file, and a fresh memtable starts.
- Compaction — background jobs merge overlapping SSTables, discard tombstones, and enforce size tiers so reads do not scan dozens of files forever.
The write path is therefore mostly sequential: append to WAL, insert into memtable, occasionally flush a large sorted run. Compare that to a B-tree update that may touch a random 4 KB page buried deep in the tree. On SSDs the gap narrows — random writes are cheaper — but LSM engines still win when insert rate dominates and you can tolerate read amplification. For B-tree mechanics and when traditional indexes help SELECT queries, see our database indexing guide.
SSTable anatomy
Each on-disk SSTable is immutable once written. Typical layout:
- Data blocks — sorted key-value pairs, often compressed (Snappy, LZ4, or Zstd).
- Index block — sparse keys pointing into data blocks so a lookup binary-searches the index, then one block.
- Bloom filter — a compact probabilistic structure (see our Bloom filters explainer) that lets the engine skip entire SSTables when a key is definitely absent.
- Footer — offsets to index, filter, and metadata.
Because files are immutable, concurrent reads need no locking — a property that simplifies replication and caching. Updates are modeled as new entries with the same key; deletes write a tombstone marker. Only compaction removes obsolete versions.
Read path: layering and amplification
To read key K, the engine:
- Searches the active memtable, then any frozen memtables awaiting flush.
- Walks SSTable levels from newest to oldest, using Bloom filters to skip
files that cannot contain
K. - Returns the first live value found (newest wins).
Read amplification is the average number of disk pages (or SSTables) touched per logical read. Without compaction, every flush adds another file to scan — reads degrade linearly with ingest volume. Compaction exists to buy back read performance at the cost of background I/O (write amplification: rewriting the same key multiple times as it migrates through levels).
Range scans and SELECT * WHERE ts BETWEEN queries are especially
sensitive: they may merge iterators across many SSTables. Point lookups with
good Bloom filters stay relatively cheap.
Compaction strategies
Compaction is the engineering heart of any LSM. Two families dominate:
Leveled compaction (LevelDB, RocksDB default)
SSTables are organized into levels L0 … Ln. L0 files
may overlap in key range; deeper levels are non-overlapping and exponentially
larger. When L0 accumulates too many files, a job picks one and
merges it into L1, which may cascade into L2, and so
on. Leveled compaction bounds read amplification (fewer files per level) but
rewrites more data — higher write amplification and compaction CPU.
Tiered / size-tiered compaction (Cassandra, older HBase)
SSTables of similar size are merged into the next tier (e.g. four 10 MB files become one 40 MB file). Write amplification is lower because data is not repeatedly pushed through many levels, but read amplification grows — more overlapping files per tier. Good for write-heavy, append-mostly workloads where reads are rare or cached.
Universal / FIFO variants
RocksDB also supports universal compaction (merge by size ratio, fewer rules) and FIFO (drop oldest files — useful for TTL caches). Choice depends on whether you optimize for p99 read latency, write throughput, or disk space.
Amplification metrics in practice
| Metric | Meaning | Typical tension |
|---|---|---|
| Write amplification (WA) | Bytes written to disk per byte ingested | Leveled > tiered; high WA wears SSDs faster |
| Read amplification (RA) | Disk reads per logical read | Tiered > leveled; hurts point lookups |
| Space amplification (SA) | Disk used vs live data (old versions + tombstones) | Compaction lag inflates SA until tombstones purge |
Tuning knobs — memtable size, level count, compaction threads, Bloom bits per key — are all levers on this triangle. There is no free lunch; you pick which amplification your SLA can absorb.
LSM vs B-tree: when to choose which
Prefer LSM when:
- Write rate is high and mostly inserts or updates to recent keys (metrics, logs, IoT, blockchain indexes).
- You can absorb compaction spikes in the background or isolate them on separate disks.
- Immutable files simplify replication (ship SSTables instead of page-level diffs).
Prefer B-tree (Postgres, MySQL InnoDB) when:
- Reads dominate and must stay predictable with complex range queries and secondary indexes on many columns.
- Update-in-place semantics and strong transactional isolation on a single node matter more than raw ingest (see ACID transactions explained).
- Write volume is moderate — B-tree random writes on NVMe are often fine.
Hybrid systems exist: MyRocks puts RocksDB under MySQL; CockroachDB uses Pebble (LSM) with distributed consensus on top. The storage engine and the query layer are separable concerns.
Operational pitfalls
- Compaction debt — if ingest outpaces compaction, L0 file count explodes and read latency spikes. Watch pending compaction bytes and L0 stall triggers.
- Tombstone accumulation — heavy deletes without compaction leave ghost keys that slow scans until merged away.
- Hot keys — all writes to one key still serialize through the memtable; LSM does not fix application-level hotspots.
- Space spikes during compaction — merging two large SSTables temporarily needs disk for inputs plus output before old files delete.
- Cache sensitivity — block cache and OS page cache matter
enormously; cold reads on deep levels hurt. Pair with the
file systems explainer
to understand why
fsyncon the WAL dominates write latency.
Practical checklist
- Model write rate, delete rate, and read pattern before picking LSM vs B-tree.
- Monitor L0 file count, compaction pending bytes, and stall counters.
- Size Bloom filters and block cache for your key distribution.
- Isolate WAL and compaction I/O on fast separate volumes when possible.
- Load-test compaction under sustained ingest — not just steady-state reads.
- Document tombstone TTL and compaction strategy for operators.
Related on Solana Garden: Database indexing guide (B-trees and EXPLAIN plans), ACID transactions explained, Inverted indexes explained, Bloom filters explained, Explainers hub.