Explainer · 7 June 2026
How virtual memory and paging work
Your laptop might have 16 GB of RAM, yet a dozen browser tabs, a code editor,
and a game each believe they own gigabytes of contiguous address space — and
none of them can read another program's memory by accident. That illusion is
virtual memory: the operating system and CPU hardware cooperate
to map every process's virtual addresses to a smaller pool of
physical frames in DRAM. The core mechanism is
paging — splitting memory into fixed-size pages (typically
4 KiB on x86-64, sometimes 2 MiB or 1 GiB "huge pages") and walking
multi-level page tables on every access. Understanding paging
explains mysterious slowdowns (page faults), why fork() can be
cheap, why swapping kills performance, and how a managed runtime's
garbage collector
sits on top of a layer that already reclaims physical RAM.
Virtual addresses vs physical frames
When your C program, Rust binary, or JVM reads *(ptr + 8), the
pointer is a virtual address — a number in the process's
private namespace, often 48 bits wide on modern 64-bit CPUs. The memory
management unit (MMU) inside the CPU translates that virtual
address to a physical frame number plus an offset within the
4 KiB page. Physical addresses refer to actual DRAM rows (or, in some cases,
memory-mapped device registers).
Isolation follows naturally: process A's virtual page 0x7fff_1000 might map
to physical frame 42, while process B's virtual page at the same numeric
address maps to frame 900 — or is marked not present until
the OS allocates backing storage. The kernel maintains a separate page table
root per process (stored in a CPU register like CR3 on x86).
Kernel memory typically lives in the upper half of the address space with
separate permissions so user code cannot read it without a syscall trap.
Page tables and the translation lookaside buffer
A naive design — one giant table indexing every virtual page — would be impossibly large. Instead, CPUs use multi-level page tables: on x86-64, four levels (PML4 → PDPT → PD → PT) form a sparse tree. A virtual address splits into index bits for each level plus a 12-bit page offset. If any level's entry is missing or marked not present, the MMU raises a page fault and the OS decides what to do.
Walking four memory reads per access would be slow, so the MMU caches recent
translations in the translation lookaside buffer
(TLB) — a fully associative cache of (virtual page → physical
frame, permissions). TLB misses add latency; TLB shootdowns
happen when the kernel changes mappings on one CPU and must invalidate entries
on others — a hidden cost of munmap, large heap growth, or
aggressive JIT code generation. Huge pages (2 MiB / 1 GiB)
reduce TLB pressure for big contiguous allocations like database buffer pools,
at the cost of internal fragmentation.
Each page-table entry carries permission bits: present, read, write, execute, user vs supervisor, and sometimes "accessed" / "dirty" bits the OS uses for eviction policies. NX (no-execute) bits stop data pages from running as code — a mitigation against shellcode, though return-oriented programming sidesteps it. ASLR (address space layout randomization) places stacks, heaps, and libraries at unpredictable virtual bases so attackers cannot hard-code offsets — security layered on top of paging.
Page faults: minor, major, and COW
A page fault is not always an error — it is the OS's hook into the MMU. Classify them by cost:
Minor (soft) faults — the page is already in RAM (perhaps zero-filled on first touch, or shared read-only code from a mapped library) but the page table entry was not wired yet. The kernel updates the PTE and resumes — microseconds.
Major (hard) faults — the page is not in RAM. The kernel blocks the thread, reads from disk (executable text from a binary, data from a memory-mapped file, or a page evicted earlier to swap), installs the frame, and resumes. Milliseconds — orders of magnitude slower than a cache hit.
Copy-on-write (COW) — after fork(), parent and
child share physical pages marked read-only. The first write faults; the
kernel copies the page and updates both page tables. Forking a large process
stays cheap until someone mutates — a classic lazy allocation pattern also
used by some string implementations and snapshotting databases.
Demand paging means the OS does not read your entire executable
into RAM at startup. It maps virtual pages as "not present" and faults them
in when the instruction pointer or a load/store first touches them. That is
why cold-start latency includes fault storms, and why madvise(MADV_WILLNEED)
or prefaulting can help latency-sensitive services.
Swapping, eviction, and thrashing
When physical memory fills, the kernel evicts cold pages to swap space on disk (a partition or swap file). Evicted pages are marked not present; a later access triggers a major fault that reads the page back — if swap is contending with SSD wear or HDD seek times, everything stalls. Thrashing is the catastrophic regime where the system spends more time moving pages than running useful work; the only fix is adding RAM or killing memory hogs.
The kernel's page replacement policy (approximations of LRU
using accessed/dirty bits, clock algorithms, or working-set tuning) decides
which frames to reclaim. Anonymous heap pages (your malloc data)
differ from file-backed pages (shared libraries, mmap
of databases) — the latter can be dropped and re-read from disk without swap,
while anonymous pages need swap backing if evicted.
Container and cloud limits add another layer: cgroups on Linux cap memory; exceeding the limit triggers OOM kill of a process rather than unbounded swapping — often preferable for multi-tenant hosts even if it looks harsh to the killed pod.
mmap, brk, and how runtimes use paging
Processes grow heap via brk/sbrk (contiguous arena above the
program break) or more commonly mmap for anonymous mappings and
file I/O. mmap maps a file's bytes directly into virtual address
space — reads and writes fault pages in from the filesystem cache, and the
kernel's page cache deduplicates RAM across processes mapping the same file.
Databases and search engines lean on this for index files larger than RAM.
Managed runtimes stack another allocator on top: the JVM, Go runtime, or
JavaScript engine request large virtual arenas, commit pages on first write,
and recycle objects within user space. Physical RAM returns to the OS only when
the runtime returns pages (malloc_trim, GC heap shrinking) — a
common reason "RSS" stays high after a traffic spike even though your language
heap looks empty. The GC frees virtual objects; paging decides which
physical frames stay resident.
Memory overcommit (common on Linux) lets processes reserve
more virtual memory than physical RAM + swap, betting not everyone touches
every page. fork() plus COW makes overcommit attractive but
dangerous: under pressure, the OOM killer may terminate processes that
successfully allocated but never used memory — understand this when sizing
servers.
How this connects to other systems ideas
Page tables are tree-structured indexes much like B-tree database indexes — sparse, hierarchical, traversed on access. The TLB is a cache in front of that index, analogous to a CPU L1 cache in front of DRAM. Hash-based structures (hash tables) solve different problems (key lookup in software) but both trade memory for speed through indirection layers.
In blockchain nodes and indexers, large state databases are often memory-mapped; page faults become the dominant cold-start cost when replaying history. In browser engines, sandboxed tabs are separate processes precisely because paging gives hardware-enforced isolation per tab — a security boundary cheaper than pure software checks alone.
Practical checklist
- Watch major page fault rates (
ps,/proc/vmstat,perf) when latency spikes — soft faults are normal; hard faults on hot paths are not. - Size RAM for working set, not peak virtual size —
VIRTintopis misleading; RES/RSS is closer to physical footprint. - Consider transparent huge pages for analytics workloads with big contiguous arrays; disable or tune if they cause latency jitter on fragmented heaps.
- After memory spikes, verify your runtime actually returns pages to the OS — paging will not shrink RSS if the allocator hoards arenas.
- In containers, set memory limits with headroom for page cache and stack; rely on OOM behavior you have tested, not hope.
- Pair OS-level paging awareness with language-level GC tuning — they reclaim at different layers.
Related on Solana Garden: garbage collection algorithms explained, hash tables explained, database indexing guide, lossless compression explained, Explainers hub.