Guide

LLM PagedAttention explained

Harbor Analytics rolled out continuous batching on vLLM and immediately hit a different wall: not compute, but memory. At month-end close, forty concurrent analysts mixed 400-token SQL questions with 18,000-token RAG packets. The scheduler accepted every request — then CUDA OOM killed 12% of sessions because each chat reserved a contiguous KV cache buffer sized for max_model_len, even when most sequences used a fraction of that ceiling. Free VRAM sat in unusable holes between finished and active requests. Enabling PagedAttention — block-based KV storage with a logical-to-physical block table, the same idea as operating-system virtual memory pages — dropped OOM rate below 0.3% and let the same A100 run 2.4× more concurrent sequences without adding GPUs.

PagedAttention is the memory-management layer that makes modern LLM serving practical. While continuous batching decides which sequences share a forward pass, PagedAttention decides where each sequence’s K and V tensors live in GPU memory. This guide explains the fragmentation problem, how fixed blocks and block tables work, pool sizing and vLLM knobs, pairing with batching and chunked prefill, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

Why contiguous KV allocation fails

Autoregressive transformers cache key and value tensors for every token already processed. Memory grows linearly with sequence length and with the number of concurrent requests on a GPU. A naive server pre-allocates one contiguous tensor per request sized for the model’s maximum context — say 32,768 tokens — even if the user sends 200 tokens and generates 50.

Three problems follow:

  • Internal fragmentation — reserved but unused slots inside each buffer cannot be given to another chat.
  • External fragmentation — when a short request finishes, it frees a large hole that may not fit the next long request without moving tensors (expensive on GPU).
  • Admission pessimism — schedulers reject new work when sum of reserved max lengths exceeds VRAM, even when actual usage would fit.

The result is GPUs that show 60–70% memory utilization in monitoring tools but cannot accept another 2K-token chat because no single contiguous slab remains. Compute sits idle while the API returns 503 or OOM errors.

Blocks, block tables, and the memory pool

PagedAttention divides KV cache into fixed-size blocks — commonly 16 tokens per block in vLLM, though engines expose this as a tunable parameter. All blocks are the same byte size, determined by layer count, KV head count, head dimension, and precision (FP16, FP8, etc.). The GPU holds a single block pool: a pre-allocated arena of N physical blocks shared by every active sequence.

Each request maintains a block table: an array mapping logical block index (0, 1, 2, … for token ranges 0–15, 16–31, …) to a physical block ID in the pool. Attention kernels index through the table so tokens in non-contiguous physical blocks still attend correctly. When a sequence grows past its current allocation, the scheduler allocates one more free physical block and appends an entry to the table — no copying of prior KV data.

Lifecycle of a sequence

  1. Admission — scheduler checks free block count against estimated blocks needed (prompt length plus generation budget).
  2. Prefill — blocks fill as prompt tokens are processed; partial last block may be shared only within that sequence.
  3. Decode — each new token may extend into a new block when the current block fills.
  4. Completion or cancel — all physical blocks in the table return to the free list immediately; no defragmentation pass required.

Because blocks are uniform, a finished 500-token chat releases perhaps 32 blocks that can serve portions of a new 20,000-token agent job within the same pool. Utilization tracks actual tokens stored, not worst-case reservations.

How PagedAttention pairs with continuous batching

Continuous batching inserts and removes sequences every forward pass. Without paged KV storage, removing one sequence from a batched contiguous buffer requires either padding waste or expensive compaction. PagedAttention lets each sequence in a batch point to its own scatter of physical blocks; the batch kernel gathers KV via per-sequence block tables.

The pairing is why vLLM, SGLang, and TensorRT-LLM ship both features together:

  • Scheduler picks which sequence IDs run this iteration.
  • Block manager ensures each ID has enough free blocks for the next token(s).
  • Attention backend reads block tables to fetch K/V for all IDs in one kernel launch.

When chunked prefill processes only part of a long prompt per step, blocks accumulate incrementally — the sequence does not need its full prompt length reserved upfront. That further improves admission accuracy under mixed short-and-long traffic.

Sizing the block pool and vLLM knobs

Total KV capacity is approximately:

num_blocks × block_size × bytes_per_token_per_layer × num_layers

vLLM derives num_blocks from gpu_memory_utilization — the fraction of device VRAM allocated to weights plus KV pool after loading the model — and from per-block byte cost. Raising utilization grows the pool but leaves less headroom for CUDA allocations and activations.

Knob Effect when raised Risk
gpu_memory_utilization More physical blocks, higher concurrency CUDA OOM on spikes; less room for graphs and temp buffers
block_size Fewer table entries per long sequence Internal waste for very short chats (whole block for 3 tokens)
max_num_seqs More parallel sequences scheduled Exhausts block pool even when compute headroom remains
KV cache dtype (FP8) Smaller blocks, more sequences per GB Quality regression on long contexts; validate on eval sets
max_model_len Allows longer single requests Admission checks assume longer worst case per sequence

Monitor block utilization (allocated blocks / total blocks) alongside GPU memory percent. A pool at 95% block utilization with 50% compute utilization means you are memory-bound on KV — tune pool size, precision, or max_num_seqs before buying another GPU. See inference SLO and capacity planning for tying these metrics to TTFT and throughput targets.

Alternatives and when they still matter

Prefix caching and block sharing

When many requests share an identical prompt prefix — system instructions, tool schemas, a fixed RAG document — engines can reference-count physical blocks across sequences instead of duplicating KV. This is distinct from API-level prompt caching but solves the same economics at the serving layer. Block-level sharing requires hash equality on prefix tokens and compatible block boundaries.

CPU or disk offload

PagedAttention optimizes on-GPU fragmentation. When total working set exceeds VRAM, frameworks may swap blocks to CPU RAM or NVMe — higher latency, but higher effective capacity. Prefer pool tuning and GQA/MQA architectures before offload unless context length is the product requirement.

Multi-GPU tensor parallel

Tensor parallelism splits layers across devices; each rank holds a fraction of KV per block. Block tables are coordinated across ranks so attention remains consistent. Pool sizing becomes a cluster-wide exercise.

Harbor Analytics OOM refactor

Harbor’s fix was not “buy more VRAM” but align scheduler admission with block accounting:

  1. Baseline — logged OOM rate, block pool exhaustion events, and reserved-vs-used token histograms at month-end peak.
  2. Block pool expansion — raised gpu_memory_utilization from 0.85 to 0.92 after FP8 weight loading freed headroom; validated quality on 50 golden RAG answers.
  3. Admission formula — replaced max-length sum check with ceil(prompt_tokens / block_size) + ceil(max_new_tokens / block_size) per request against free block count.
  4. Cancel propagation — SSE disconnect returns blocks within one decode step instead of holding until generation timeout.
  5. Prefix block sharing — enabled for shared month-end report template (4,200 tokens); cut duplicate KV by ~38% for that cohort.
  6. Alerts — page when block utilization > 90% for five minutes or OOM rate > 1% over fifteen minutes.

OOM rate fell from 12% to 0.28%; median concurrent sequences rose from 28 to 67 on the same A100 80GB. TTFT improved slightly because fewer requests failed and retried.

Technique decision table

Your situation Prefer Avoid
Single-user offline script, one sequence at a time Simple contiguous KV or framework default Block pool tuning overhead
Multi-tenant chat API with variable lengths PagedAttention (vLLM, SGLang, TRT-LLM) Per-request max-length contiguous buffers
Many requests share long system prompt Prefix block sharing plus API prompt caching Duplicating full KV per session
Extreme context (128K+) on limited VRAM FP8 KV, GQA models, offload tier FP16 KV with default pool on 24GB cards
Memory errors despite low compute util Increase block pool, lower max_num_seqs, fix admission math Adding GPUs without fixing fragmentation
Uniform batch jobs (same length) Static batching; paging optional Over-engineering block tables for one-shot jobs
Provider API only Gateway queueing; trust provider paging Reimplementing PagedAttention client-side

Common pitfalls

  • Tuning max_num_seqs without block pool headroom. Scheduler accepts work the pool cannot store.
  • Ignoring block internal waste for chatbots. Tiny block_size increases table overhead; huge blocks waste RAM on short queries.
  • Assuming GPU memory % equals KV efficiency. Contiguous reservation inflates reported usage.
  • Skipping cancel cleanup. Ghost sequences hold blocks for hundreds of decode steps.
  • FP8 KV without quality regression tests. Long-context faithfulness can degrade silently.
  • Prefix sharing across prompt versions. Stale blocks if system prompt hash gate is wrong.
  • Admission based on max_model_len for all users. Use actual prompt length plus generation budget.
  • No metrics on block utilization. You cannot right-size what you only measure as CUDA OOM stack traces.

Production checklist

  • Run concurrent serving through a PagedAttention-capable engine (vLLM, SGLang, TRT-LLM).
  • Size block pool via gpu_memory_utilization after model weights load.
  • Pick block_size from traffic mix (short chat vs long RAG).
  • Implement admission on required blocks, not sum of max context lengths.
  • Instrument allocated blocks, free blocks, and OOM rate per GPU.
  • Enable prefix block sharing for identical system or document prefixes.
  • Propagate client disconnect to free blocks within one decode step.
  • Pair paging with continuous batching and chunked prefill for mixed workloads.
  • Validate FP8 or INT8 KV on long-context golden sets before production.
  • Alert on sustained block utilization above 90%.
  • Re-tune after model swaps (layers, heads, and context length change block bytes).

Key takeaways

  • PagedAttention stores KV cache in fixed physical blocks mapped by per-sequence block tables, eliminating contiguous-allocation fragmentation.
  • Block pools track actual token usage, so GPUs accept more concurrent chats than max-length reservation math allows.
  • Continuous batching and chunked prefill depend on paged KV to add and remove sequences cheaply each iteration.
  • Harbor Analytics cut OOM from 12% to 0.28% by fixing block admission and pool sizing, not by adding hardware.
  • Monitor block utilization, not just GPU memory percentage, when tuning inference capacity.

Related reading