Guide

LLM KV cache offloading explained

Harbor Analytics' compliance RAG fleet ran PagedAttention on eight A100 80GB workers with FP8 KV and H2O eviction already enabled. Month-end still rejected 22% of 28K-token agent sessions because the GPU block pool could not hold every concurrent sequence — even though host DRAM sat 60% idle. The platform team added tiered KV offloading: cold prefix blocks migrate to pinned CPU memory and stream back only when attention needs them. Concurrent long-context jobs per GPU rose from 2.8 to 6.7; decode P99 latency stayed under 180 ms once migration policy matched real access patterns instead of naive LRU.

KV cache offloading extends effective context capacity by placing some key-value blocks outside GPU HBM — typically CPU DRAM, rarely NVMe for archival prefixes. Unlike compression, offloading preserves full per-token KV tensors; unlike truncation, it keeps the entire logical sequence addressable. The cost is PCIe bandwidth and migration latency on every decode step that touches offloaded blocks. This guide covers the memory hierarchy, block migration mechanics in paged engines, vLLM swap_space sizing, prefill versus decode offload tradeoffs, pairing with prefill/decode disaggregation, the Harbor Analytics refactor, a technique decision table versus compression-only and disaggregation-only stacks, pitfalls, and a production checklist — building on KV cache fundamentals.

Why offload instead of compress, truncate, or add GPUs

Teams hit the KV wall when seq_len × concurrent_sessions exceeds GPU block pool capacity. Four responses compete:

Approach What changes Quality impact Latency impact
Buy more VRAM / GPUs Hardware scale-out None Low if batching improves
KV compression / eviction Fewer bytes or tokens stored Bounded approximation risk Often faster (less KV read)
Context truncation / RAG Shorter working window May drop facts Retrieval adds TTFT
KV offloading Blocks live off-GPU until needed Lossless if migration correct PCIe-bound on cold access

Offloading shines when you must retain full attention history for audit, legal replay, or needle-in-haystack QA — cases where eviction is unacceptable but DRAM on the host is plentiful. It is a poor default for latency-sensitive chat at short context; compression or shorter windows win there.

Memory tiers and block migration

Modern serving stacks treat KV as fixed-size blocks (often 16 tokens) mapped through logical block tables. Offloading adds tiers:

  • GPU resident — hot blocks in HBM; full bandwidth for attention reads during decode.
  • CPU pinned — blocks in page-locked host DRAM; copied over PCIe on demand. Typical pool: 4–64 GB per GPU via vLLM swap_space (GiB per GPU).
  • NVMe / disk — rare in online serving; used for checkpointing very long agent traces or warm-restart prefixes. Millisecond to second latency per block — not decode-path friendly.

Migration triggers

Engines evict GPU blocks when the pool crosses a watermark (e.g. 90% allocated). Policies differ:

  • LRU on physical blocks — simple; evicts recent tail if mis-tuned, destroying decode performance.
  • Prefix-prefer-resident — keep first P blocks on GPU (shared system prompts); offload middle document blocks.
  • Recency tail pin — last T blocks always GPU-resident because every decode step attends to the full sequence.
  • Per-session budget — cap GPU blocks per request; spill remainder to CPU deterministically at admission.

Decode reads the entire KV history each step. Offloading early prefix blocks hurts less than offloading the trailing window — attention to distant tokens is sparse in many models, but the read still traverses block tables. Profile with Nsight or engine counters before picking policy.

PCIe bandwidth math

A 70B-class layer group might read hundreds of MB of KV per decode step if many blocks are cold on CPU. Gen4 x16 delivers ~25 GB/s practical throughput — a few cold blocks can add milliseconds per token. Batching amortizes weight reads but not always per-sequence KV scatter. If offload rate × block_size × layers exceeds PCIe budget, P99 decode explodes.

Prefill versus decode offload

Phase Offload behavior Design note
Prefill Writes KV for full prompt; may spill immediately if pool tight Peak PCIe writes; consider chunked prefill to smooth allocation
Decode Appends one block per N tokens; may evict coldest GPU block Steady-state migration; tail-pin policy critical
Prefix cache hit Reuses GPU or CPU-resident shared prefix blocks Pair with prefix caching to avoid re-migrating hot templates

Some fleets disaggregate prefill and decode onto different pools: prefill workers write KV directly into CPU-backed pools; decode workers pull blocks into GPU as needed. That separates bursty prefill PCIe from latency-sensitive decode when network between pools is fast enough.

vLLM and production configuration

In vLLM, --swap-space (GiB per GPU) reserves CPU memory for evicted blocks. Rules of thumb:

  • Size swap ≥ expected spill per GPU at peak concurrency, not merely “a few GB.”
  • Leave headroom on host RAM for OS page cache, tokenizer, and LoRA staging — swapping KV into contended DRAM triggers kernel thrash.
  • Enable block pool metrics: gpu_cache_usage_perc, cpu_cache_usage_perc, blocks swapped per second.
  • Admission control should estimate GPU and CPU block demand; admitting sessions that fit only on CPU still consumes swap budget.

Combine with FP8 KV to halve bytes per block — offloading compressed blocks reduces PCIe volume. Do not stack aggressive eviction and offloading without clarity: evicted tokens should not also occupy CPU swap slots.

Harbor Analytics tiered-memory refactor

Harbor's compliance agents required verbatim 28K-token replay for regulators — H2O middle-token drops were disallowed. The refactor:

  1. Baseline — measured GPU block exhaustion vs host DRAM idle; traced decode steps with >8 cold CPU blocks per token.
  2. Swap sizing — raised swap_space from 4 GiB to 48 GiB per GPU on 512 GiB hosts.
  3. Migration policy — prefix-prefer-resident for first 2,048 tokens; LRU only among middle document blocks; tail 512 tokens pinned GPU.
  4. Admission — reject only when gpu_blocks + cpu_blocks > fleet_cap, not GPU-only.
  5. Prefix cache — shared 3,800-token compliance template pinned GPU across sessions; middle PDF blocks CPU-tiered.
  6. Alerts — P99 decode >250 ms or cpu_blocks_read_per_token >12 triggers policy review.

Session rejection fell from 22% to 4% at month-end peak. Concurrent 28K jobs per GPU rose 2.4×. Needle recall stayed at 100% (lossless KV). Decode P50 rose 11 ms; P99 rose 34 ms — acceptable versus hard rejections.

Technique decision table

Your situation Prefer Avoid
Contexts <8K, VRAM sufficient GPU-only paging CPU swap complexity
Lossless long context, host DRAM available Tiered offload + tail-pin policy Aggressive token eviction
Latency-critical short chat Compression or shorter window Per-step CPU KV reads
Shared long prefixes across tenants Prefix cache on GPU + offload unique suffix Offloading hot shared blocks
PCIe-saturated decode P99 More VRAM, disaggregated pools, or compression Larger swap without policy change
Multi-GPU tensor parallel Coordinate swap per rank; watch NUMA Independent swap sizes per rank

Pitfalls

  • Host OOM — swap_space × GPU count can exceed physical RAM; the OOM killer takes the worker.
  • NUMA blindness — pinning CPU blocks on the wrong socket doubles PCIe hops on dual-socket servers.
  • LRU on decode tail — evicting recent blocks multiplies migrations every token.
  • Ignoring prefill spikes — admitting ten 32K prefills simultaneously fills swap before decode starts.
  • CUDA graph capture — dynamic migration can break graph replay; validate mixed resident/offload batches.
  • Silent quality drift — buggy block tables cause wrong attention, not obvious crashes; keep golden long-context tests.

Production checklist

  • Measure GPU block pool exhaustion rate before enabling swap.
  • Size swap_space from peak concurrent long sessions, not defaults.
  • Implement tail-pin and prefix-resident migration policy.
  • Expose GPU and CPU cache usage metrics per worker.
  • Admission estimates total blocks across tiers.
  • Pair with FP8 KV or compression to cut PCIe bytes if lossless enough.
  • Profile decode with cold-block read counts per token.
  • Validate NUMA pinning on multi-socket hosts.
  • Regression-test prefix cache hits with offloaded suffix blocks.
  • Document when offload is disabled for latency SLO tiers.

Key takeaways

  • KV offloading spills paged blocks to CPU DRAM (or beyond) while preserving full sequences — lossless but PCIe-sensitive.
  • Migration policy matters more than swap size: pin decode tails and hot prefixes, offload cold middle blocks.
  • Offloading complements compression and disaggregation; it does not replace them when latency or bytes are the binding constraint.
  • Harbor Analytics cut long-context rejections 22%→4% with 48 GiB swap and prefix/tail pinning.
  • Monitor cold-block reads per decode step — the early signal that P99 decode is about to break.

Related reading