Explainer · 7 June 2026

How virtual memory and paging work

Your laptop might have 16 GB of RAM, yet a dozen browser tabs, a code editor, and a game each believe they own gigabytes of contiguous address space — and none of them can read another program's memory by accident. That illusion is virtual memory: the operating system and CPU hardware cooperate to map every process's virtual addresses to a smaller pool of physical frames in DRAM. The core mechanism is paging — splitting memory into fixed-size pages (typically 4 KiB on x86-64, sometimes 2 MiB or 1 GiB "huge pages") and walking multi-level page tables on every access. Understanding paging explains mysterious slowdowns (page faults), why fork() can be cheap, why swapping kills performance, and how a managed runtime's garbage collector sits on top of a layer that already reclaims physical RAM.

Virtual addresses vs physical frames

When your C program, Rust binary, or JVM reads *(ptr + 8), the pointer is a virtual address — a number in the process's private namespace, often 48 bits wide on modern 64-bit CPUs. The memory management unit (MMU) inside the CPU translates that virtual address to a physical frame number plus an offset within the 4 KiB page. Physical addresses refer to actual DRAM rows (or, in some cases, memory-mapped device registers).

Isolation follows naturally: process A's virtual page 0x7fff_1000 might map to physical frame 42, while process B's virtual page at the same numeric address maps to frame 900 — or is marked not present until the OS allocates backing storage. The kernel maintains a separate page table root per process (stored in a CPU register like CR3 on x86). Kernel memory typically lives in the upper half of the address space with separate permissions so user code cannot read it without a syscall trap.

Page tables and the translation lookaside buffer

A naive design — one giant table indexing every virtual page — would be impossibly large. Instead, CPUs use multi-level page tables: on x86-64, four levels (PML4 → PDPT → PD → PT) form a sparse tree. A virtual address splits into index bits for each level plus a 12-bit page offset. If any level's entry is missing or marked not present, the MMU raises a page fault and the OS decides what to do.

Walking four memory reads per access would be slow, so the MMU caches recent translations in the translation lookaside buffer (TLB) — a fully associative cache of (virtual page → physical frame, permissions). TLB misses add latency; TLB shootdowns happen when the kernel changes mappings on one CPU and must invalidate entries on others — a hidden cost of munmap, large heap growth, or aggressive JIT code generation. Huge pages (2 MiB / 1 GiB) reduce TLB pressure for big contiguous allocations like database buffer pools, at the cost of internal fragmentation.

Each page-table entry carries permission bits: present, read, write, execute, user vs supervisor, and sometimes "accessed" / "dirty" bits the OS uses for eviction policies. NX (no-execute) bits stop data pages from running as code — a mitigation against shellcode, though return-oriented programming sidesteps it. ASLR (address space layout randomization) places stacks, heaps, and libraries at unpredictable virtual bases so attackers cannot hard-code offsets — security layered on top of paging.

Page faults: minor, major, and COW

A page fault is not always an error — it is the OS's hook into the MMU. Classify them by cost:

Minor (soft) faults — the page is already in RAM (perhaps zero-filled on first touch, or shared read-only code from a mapped library) but the page table entry was not wired yet. The kernel updates the PTE and resumes — microseconds.

Major (hard) faults — the page is not in RAM. The kernel blocks the thread, reads from disk (executable text from a binary, data from a memory-mapped file, or a page evicted earlier to swap), installs the frame, and resumes. Milliseconds — orders of magnitude slower than a cache hit.

Copy-on-write (COW) — after fork(), parent and child share physical pages marked read-only. The first write faults; the kernel copies the page and updates both page tables. Forking a large process stays cheap until someone mutates — a classic lazy allocation pattern also used by some string implementations and snapshotting databases.

Demand paging means the OS does not read your entire executable into RAM at startup. It maps virtual pages as "not present" and faults them in when the instruction pointer or a load/store first touches them. That is why cold-start latency includes fault storms, and why madvise(MADV_WILLNEED) or prefaulting can help latency-sensitive services.

Swapping, eviction, and thrashing

When physical memory fills, the kernel evicts cold pages to swap space on disk (a partition or swap file). Evicted pages are marked not present; a later access triggers a major fault that reads the page back — if swap is contending with SSD wear or HDD seek times, everything stalls. Thrashing is the catastrophic regime where the system spends more time moving pages than running useful work; the only fix is adding RAM or killing memory hogs.

The kernel's page replacement policy (approximations of LRU using accessed/dirty bits, clock algorithms, or working-set tuning) decides which frames to reclaim. Anonymous heap pages (your malloc data) differ from file-backed pages (shared libraries, mmap of databases) — the latter can be dropped and re-read from disk without swap, while anonymous pages need swap backing if evicted.

Container and cloud limits add another layer: cgroups on Linux cap memory; exceeding the limit triggers OOM kill of a process rather than unbounded swapping — often preferable for multi-tenant hosts even if it looks harsh to the killed pod.

mmap, brk, and how runtimes use paging

Processes grow heap via brk/sbrk (contiguous arena above the program break) or more commonly mmap for anonymous mappings and file I/O. mmap maps a file's bytes directly into virtual address space — reads and writes fault pages in from the filesystem cache, and the kernel's page cache deduplicates RAM across processes mapping the same file. Databases and search engines lean on this for index files larger than RAM.

Managed runtimes stack another allocator on top: the JVM, Go runtime, or JavaScript engine request large virtual arenas, commit pages on first write, and recycle objects within user space. Physical RAM returns to the OS only when the runtime returns pages (malloc_trim, GC heap shrinking) — a common reason "RSS" stays high after a traffic spike even though your language heap looks empty. The GC frees virtual objects; paging decides which physical frames stay resident.

Memory overcommit (common on Linux) lets processes reserve more virtual memory than physical RAM + swap, betting not everyone touches every page. fork() plus COW makes overcommit attractive but dangerous: under pressure, the OOM killer may terminate processes that successfully allocated but never used memory — understand this when sizing servers.

How this connects to other systems ideas

Page tables are tree-structured indexes much like B-tree database indexes — sparse, hierarchical, traversed on access. The TLB is a cache in front of that index, analogous to a CPU L1 cache in front of DRAM. Hash-based structures (hash tables) solve different problems (key lookup in software) but both trade memory for speed through indirection layers.

In blockchain nodes and indexers, large state databases are often memory-mapped; page faults become the dominant cold-start cost when replaying history. In browser engines, sandboxed tabs are separate processes precisely because paging gives hardware-enforced isolation per tab — a security boundary cheaper than pure software checks alone.

Practical checklist

Watch major page fault rates (ps, /proc/vmstat, perf) when latency spikes — soft faults are normal; hard faults on hot paths are not.
Size RAM for working set, not peak virtual size — VIRT in top is misleading; RES/RSS is closer to physical footprint.
Consider transparent huge pages for analytics workloads with big contiguous arrays; disable or tune if they cause latency jitter on fragmented heaps.
After memory spikes, verify your runtime actually returns pages to the OS — paging will not shrink RSS if the allocator hoards arenas.
In containers, set memory limits with headroom for page cache and stack; rely on OOM behavior you have tested, not hope.
Pair OS-level paging awareness with language-level GC tuning — they reclaim at different layers.