Explainer · 7 June 2026

How memory allocators work: malloc, free lists, and jemalloc

When a C program calls malloc(64), Rust's allocator API requests a buffer, or a JVM allocates a new object, something must carve usable blocks out of the process's address space. That job belongs to a memory allocator — user-space code that sits between your language runtime and the operating system's virtual memory layer. The allocator's goals are deceptively simple: hand out memory quickly, reuse freed blocks instead of asking the kernel every time, and keep fragmentation low enough that a long-running server does not slowly bloat. In practice, allocators are among the most performance-critical libraries on the machine — a bad one shows up as lock contention, cache misses, and mysterious RSS that never drops after a traffic spike.

Stack, heap, and when the OS gets involved

Not every byte your program touches goes through malloc. Function locals and call frames live on the stack — a contiguous region that grows and shrinks with calls, freed automatically when a function returns. The stack is fast (just bump a pointer) but small and LIFO-only; you cannot return a pointer to a stack variable and expect it to survive.

Dynamic lifetimes — strings that outlive a function, graph nodes, request buffers — need the heap. On Linux, growing the heap historically used the brk syscall to extend the data segment; modern allocators also call mmap to map anonymous pages for large allocations or separate arenas. Those syscalls acquire virtual pages from the kernel; physical RAM is charged later on first write (demand paging). Very large allocations (often 128 KiB or more, depending on the libc) bypass the main heap entirely and get their own mmap region, returned to the OS on free.

Inside glibc malloc: chunks, bins, and free lists

GNU libc's malloc (derived from ptmalloc) organizes the heap as a sequence of chunks. Each chunk has metadata — size, flags for whether it is in use, and pointers linking it into free lists when available. Small requests are rounded up to size classes (8-byte steps for tiny sizes, larger steps for bigger blocks) so that a freed 40-byte object can satisfy a future 36-byte request without splitting.

Freed small chunks go into fastbins (single-linked LIFO lists for very small sizes) or unsorted / small / large bins sorted by size for coalescing neighbors. When you free adjacent free chunks, the allocator coalesces them into one larger block — critical for fighting external fragmentation, where plenty of free bytes exist but none in a contiguous block large enough for the next request. If no bin fits, malloc may split a larger free chunk or extend the heap via brk/mmap.

Internal fragmentation is different: rounding a 17-byte request to a 32-byte size class wastes 15 bytes inside the allocated block. Allocators accept some internal waste to keep bookkeeping simple and lookups fast.

Thread contention and why jemalloc exists

A single global heap lock was fine in 1995. On a 64-core server running thousands of concurrent requests, every malloc fighting for one mutex becomes a bottleneck. Modern allocators partition memory into arenas — each thread prefers its own arena or rotates among a few, so most allocations avoid cross-core locking.

jemalloc (used by Firefox, Rust's default on many platforms, and optional in Redis/MariaDB) names regions explicitly: arena → chunk → page run → region. Thread-local caches (tcache) satisfy hot allocate/free pairs without touching arena locks at all; refills happen in batches. tcmalloc (Google's allocator) uses a similar central heap plus per-thread cache design. Both target multi-threaded servers where allocation rate, not peak size, dominates.

Choosing an allocator is a deployment decision: link LD_PRELOAD=libjemalloc.so on a C++ service, set Rust's global allocator, or compile Node with --with-jemalloc. Profiling with malloc_stats, jemalloc's prof mode, or heaptrack reveals whether contention or fragmentation is the real enemy.

Performance: what allocators optimize for

Allocator design is a bundle of trade-offs. Throughput favors thread caches and size-class segregation. Memory efficiency favors aggressive coalescing and returning pages to the OS — often slower. Latency predictability matters for games and trading systems: avoiding pathological splits and huge-arena growth spikes can matter more than average speed.

Data layout interacts with hardware: objects allocated together often land in adjacent addresses, which can improve spatial locality — or cause false sharing if two hot counters sit in the same cache line. Pool and slab allocators (fixed-size object pools used inside kernels and databases) eliminate per-object size lookup entirely: every slot is the same width, so free lists are a simple pop/push.

Languages with garbage collection still depend on allocators underneath — the GC allocates new objects via the same heap (or a dedicated nursery region). A generational collector's nursery is essentially a bump-pointer arena reset each young collection; see how GC algorithms work for when collection runs vs when raw malloc pressure returns pages.

Why RSS stays high after you free

This surprises operators constantly: your service processes a spike, frees everything, yet RES in top barely moves. Frees return chunks to allocator free lists, not immediately to the kernel. Holding cached free memory makes the next spike cheaper. Allocators eventually call madvise(MADV_DONTNEED) or similar to drop physical pages, but thresholds are conservative — and fragmented heaps may have no whole page that is entirely empty.

Container memory limits (cgroup v2) count RSS toward the limit; an allocator hoarding arenas can trigger OOM kills even when your application logic believes it released objects. Tools like jemalloc's background_thread for purging, or explicit malloc_trim on glibc (use sparingly), can help — but the durable fix is usually reducing allocation churn (reuse buffers, object pools) rather than fighting the allocator after the fact.

Practical checklist

Profile before swapping allocators — measure lock time, allocation rate, and peak RSS, not folklore.
Watch allocation churn in hot paths; fewer allocations beat a faster allocator.
For long-lived servers, test under sustained load: fragmentation shows up over hours, not in a 30-second benchmark.
Large transient buffers (>128 KiB) often use direct mmap; ensure you actually free them or leak whole mappings.
In containers, set memory limits with headroom for allocator caches and page cache; see cgroups and namespaces.
Pair allocator tuning with OS-level awareness from virtual memory and paging — they reclaim at different layers.