Explainer · 7 June 2026

How CPU caches and the memory hierarchy work

A modern CPU core can retire an instruction every clock cycle at 4 GHz — roughly four billion operations per second. Main memory (DRAM) sits on a different planet: a single random load from RAM might cost 200–300 cycles while the core stalls waiting for data. Multiply that across every array walk, pointer chase, and hash-table probe in your program and the gap explains why two algorithms with identical Big-O complexity can differ by an order of magnitude in wall-clock time. CPU caches are small, fast SRAM buffers that sit between the core and DRAM, automatically keeping recently used data close so the pipeline does not starve.

The memory hierarchy pyramid

Hardware is organized as a ladder of speed, size, and cost per bit. At the top:

Registers — dozens of operands the ALU can touch in one cycle; the compiler (or you, in assembly) names them explicitly.
L1 cache — typically 32–64 KiB per core, split into instruction (L1i) and data (L1d). Hit latency is on the order of 4 cycles.
L2 cache — hundreds of KiB to a few MiB per core; ~12 cycles on many x86 designs.
L3 cache — often shared across cores, tens of MiB; ~40 cycles but still far faster than DRAM.
Main memory (DRAM) — gigabytes, ~200+ cycles per miss.
Storage (SSD/NVMe) — milliseconds unless data was prefetched into RAM by the OS page cache.

Each level is larger and slower than the one above. The hardware's job is to make the common case look like everything lives in L1. Your job as a programmer is to make the common case actually reuse nearby data — that property is called locality.

Temporal and spatial locality

Caches exploit two predictable patterns in real programs:

Temporal locality — if you read variable x now, you will probably read it again soon (loop counters, hot object fields).
Spatial locality — if you read byte at address A, you will probably read A+1, A+2, … soon (scanning arrays, sequential file I/O).

A row-major matrix multiply enjoys spatial locality along rows; column-major access on a row-major layout jumps strides and thrashes cache lines. Linked lists have poor spatial locality because nodes scatter across the heap — one reason a dense hash table with open addressing can beat a tree on modern hardware despite worse asymptotics for some operations.

Cache lines: the unit of transfer

Caches do not fetch individual bytes. They move fixed-size blocks — almost always 64 bytes on x86-64 and ARM64 — called cache lines. When you load one integer from an array, the hardware pulls the entire line containing that integer into L1. Subsequent loads to neighbors in the same line are cheap hits.

This has two consequences developers feel directly:

Alignment and padding — structures that cross line boundaries may require two loads instead of one.
False sharing — two threads updating different variables that happen to live in the same cache line force the cores to bounce exclusive ownership back and forth, serializing what looks like independent work. Fix by padding hot counters to cache-line boundaries (often 64 bytes).

Profilers like perf on Linux expose cache-misses and cache-references events; a high miss rate on an apparently linear scan is a signal to revisit layout or access order before rewriting the algorithm.

How a lookup works: tags, sets, and associativity

Conceptually each cache line slot stores:

Tag — which region of main memory this line came from.
Data — the 64 bytes of payload.
Status bits — valid, dirty (modified), shared/exclusive in multi-core coherence protocols like MESI.

Physical memory addresses map into cache sets. In a direct-mapped cache each address has exactly one home slot — simple hardware, but two hot lines fighting for the same slot cause conflict misses even when plenty of empty capacity exists elsewhere. Real CPUs use set-associative caches: each set holds N ways (4-way, 8-way, …) and the controller picks among them on a miss. Higher associativity reduces conflict misses at the cost of lookup latency and silicon area.

On a cache hit, the core gets data in a handful of cycles. On a miss, the line is fetched from a deeper level (L2, then L3, then DRAM) and the instruction that triggered the miss stalls — or, on out-of-order CPUs, later independent instructions may continue while the load retires when data arrives.

Write policies and eviction

Stores complicate caching because memory must eventually reflect updates:

Write-through — every store updates cache and DRAM immediately. Simple coherence, but DRAM traffic on every write.
Write-back — stores hit the cache line and mark it dirty; DRAM is updated only when the line is evicted. Most L1/L2 designs use write-back for bandwidth efficiency.

When a set is full and a new line must enter, hardware evicts an existing line. The replacement policy is usually an approximation of LRU (least recently used) — track which way in the set was touched longest ago. Streaming through an array larger than the cache evicts old lines before you reuse them: one pass over a 1 GiB buffer on a 32 MiB L3 is effectively uncached from the CPU's perspective even though "the data is in RAM."

Prefetching and TLBs

Hardware prefetchers watch access patterns and issue speculative loads ahead of explicit instructions — stride prefetchers excel on sequential scans; pointer-chasing defeats them. Software can hint with intrinsics (_mm_prefetch on x86) when you know the next touch but the compiler does not.

Before a cache lookup, the CPU translates virtual addresses to physical frames via the translation lookaside buffer (TLB) — a cache for page table entries. A TLB miss adds dozens of cycles on top of data cache misses. Huge pages (2 MiB / 1 GiB on Linux) reduce TLB pressure for large, long-lived allocations such as database buffer pools — the same workloads where B-tree index pages benefit from fitting more keys per cache line.

Layout choices that actually matter

You rarely control cache associativity, but you control data shape:

Array of structs vs struct of arrays — processing one field across millions of rows (SoA) keeps each inner loop sequential in memory; AoS scatters fields and wastes line bandwidth when you only need .x.
Hot/cold splitting — separate rarely touched metadata from fields updated every iteration so each cache line carries useful bytes.
Blocking / tiling — matrix kernels partition work into tiles that fit in L2/L3 so reused values stay resident across the inner loops.
Pointer chasing depth — each hop is a potential miss; flattening trees into vectors or using probabilistic filters to skip cold storage cuts DRAM round trips.

Managed runtimes add another layer: the garbage collector moves objects and compacts heaps, which can help or hurt locality depending on generational placement. A GC pause is not just "stop the world" — it is also a moment where cache hot sets go cold.

Multi-core coherence in one paragraph

Each core has private L1 (and usually L2) caches but shares L3 and DRAM. When core A writes a line core B holds, the coherence protocol invalidates or updates B's copy — invisible to single-threaded code, painful when threads share writable state. Read-mostly data scales; write-heavy shared counters do not unless you shard, batch, or use per-thread accumulators merged once. This is the hardware reason lock-free queues still care about line alignment.

Practical checklist

Measure with cache-miss counters before micro-optimizing Big-O.
Prefer sequential access patterns; avoid strided walks through large structures.
Pad per-thread counters to cache-line size to prevent false sharing.
Size working sets to fit L3 when possible; tile algorithms that do not.
Consider SoA layout for numeric kernels; keep hot fields dense.
Watch TLB misses on multi-GB heaps; huge pages help steady-state servers.