Guide

LLM serving prefix cache and RadixAttention explained

Harbor Analytics runs a self-hosted vLLM cluster for SQL copilot and document Q&A. Every analyst session starts with the same 9,200-token stack: system instructions, tool schemas, and a frozen quarterly metrics glossary. User questions vary — 80 to 400 tokens — but prefill still recomputed the entire prefix on every HTTP request. Median time-to-first-token (TTFT) sat at 1.9 s even after PagedAttention eliminated OOMs. Provider prompt caching does not apply to their own GPUs. Enabling automatic prefix caching (RadixAttention-style KV block sharing across requests) cut median TTFT to 0.61 s and freed 34% of prefill FLOPs for longer RAG retrievals.

Serving prefix cache stores computed KV tensors for token prefixes that repeat across separate requests. Unlike per-request KV cache (freed when generation ends) and unlike API-level prompt cache (provider-managed, billed per cached token), self-hosted prefix cache lives inside the inference engine and reuses physical GPU blocks from PagedAttention pools. This guide covers RadixAttention prefix trees, vLLM and SGLang implementations, hit-rate measurement, eviction and memory tradeoffs, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

Three layers of “caching” in LLM inference

Teams conflate three mechanisms that all reduce redundant work but operate at different scopes:

Layer Scope Typical owner What is reused
Per-request KV cache Single generation Any transformer server K/V for tokens already processed in this request
Serving prefix cache Across requests on same GPU pool vLLM, SGLang, TensorRT-LLM Physical KV blocks for identical token prefixes
API prompt cache Across calls to cloud provider OpenAI, Anthropic, etc. Provider-internal state; discounted input pricing

Serving prefix cache matters when you self-host, run multi-tenant RAG with shared system prompts, or batch many short queries against the same document header. Semantic caching is yet another layer — paraphrase-tolerant answer reuse at the application level, not KV block sharing.

How RadixAttention prefix trees work

RadixAttention (popularized by SGLang) organizes cached prefixes as a radix tree keyed by token IDs. Each node holds a sequence segment and points to allocated KV blocks in GPU memory. When a new request arrives:

  1. Tokenize the prompt and walk the tree to find the longest matching prefix.
  2. Reuse cached KV blocks for the matched segment — skip prefill for those tokens.
  3. Run prefill only on the unmatched suffix (user question, retrieved chunks).
  4. Insert new nodes for novel suffixes so future requests can branch from them.

Two requests sharing a 9,000-token system prompt but differing at token 9,001 traverse the same path until the branch point, then allocate separate child nodes. This is more memory-efficient than hashing entire prompts: partial overlaps (same system prompt, different tool subsets) still share trunk blocks.

vLLM’s automatic prefix caching (v0.4+) implements a similar idea atop its block manager: hashed block chains reference-count physical pages. When reference counts hit zero, blocks return to the free pool. Both designs depend on PagedAttention — without fixed-size pages, sharing contiguous buffers across requests is impractical.

Cache hit rate and what actually moves latency

Define prefix hit ratio as cached prefix tokens divided by total prompt tokens per request. A 9,200-token shared header plus a 300-token user query yields 96.8% hit ratio on the prefix — but TTFT improvement depends on prefill cost dominance. Short decode-heavy chats see smaller gains; long RAG prefill-heavy workloads see dramatic drops.

Track these metrics per model replica:

  • Prefix hit ratio (tokens) — rolling p50/p95 per route.
  • TTFT with vs without cache — label misses when hash diverges (tokenizer drift, whitespace).
  • Cache residency bytes — pinned prefixes competing with active sequences for block pool.
  • Eviction rate — how often hot prefixes are dropped under memory pressure.

Harbor Analytics tagged routes (sql-copilot vs doc-qa) and discovered doc-qa hit 91% while ad-hoc SQL hit 12% — the latter had no stable prefix. They split block pools so hot glossary prefixes were not evicted by bursty one-off prompts.

Memory tradeoffs and eviction

Prefix cache is not free. Cached blocks consume the same GPU page pool as live requests. Aggressive caching without eviction caps can starve admission for new long-context sessions — the OOM problem returns as cache pressure instead of fragmentation.

Common policies:

  • LRU eviction — drop least-recently-used prefix blocks when pool utilization exceeds a threshold (typical default).
  • Pinning hot prefixes — mark stable system prompts as non-evictable; reserve a dedicated cache budget.
  • TTL by content version — invalidate when glossary or tool schema version bumps, even if tokens match.
  • Per-tenant isolation — separate radix subtrees so one tenant’s prompts do not pollute another’s cache keys.

Pair prefix cache with chunked prefill on cache misses so a long suffix does not block the entire batch scheduler.

Harbor Analytics refactor (worked example)

Problem: 400 concurrent analysts, shared 9,200-token preamble, self-hosted Llama-3.1-70B on 4× A100. PagedAttention fixed OOM; TTFT still unacceptable for interactive SQL.

Changes:

  1. Enabled vLLM enable_prefix_caching=True with 12% of block pool reserved for pinned system prompts.
  2. Normalized preamble bytes (stable JSON key order in tool schemas) so hashes matched across clients.
  3. Split retrieval chunks after the cached glossary block — static first, dynamic last (mirrors API cache breakpoint layout).
  4. Added Prometheus gauges for prefix hit ratio and cache eviction counters.

Results: Median TTFT 1.9 s → 0.61 s on sql-copilot; p95 4.2 s → 1.4 s. GPU prefill utilization fell 34%, allowing 18% more concurrent seats without new hardware. Doc-qa routes with unique 15K-token retrieval packs saw only 8% TTFT improvement — expected low prefix overlap.

Technique decision table

Scenario Prefer Avoid
Self-hosted vLLM/SGLang, shared system prompt Serving prefix cache + pinned hot prefixes Re-prefill full prompt every request
Cloud API only, static RAG header API prompt cache breakpoints Self-hosted radix tuning
Unique prompts every call (no overlap) Focus on batching and quantization Large pinned cache pools
Multi-tenant SaaS Tenant-scoped cache keys + LRU caps Global shared tree without isolation
Frequent schema/tool updates Versioned cache keys + explicit invalidation Assuming byte-identical prompts forever
Memory already saturated Shrink cache budget, raise eviction Pinning more prefixes
Paraphrased duplicate questions Semantic cache at app layer Expecting radix hits on different tokens

Common pitfalls

  • Whitespace and JSON key order drift. Identical semantics, different bytes — cache miss on every call.
  • Tokenizer mismatch between clients. Tree keys are token IDs; different chat templates break sharing.
  • Pinning without a budget. Hot prefixes evict active sequences or cause admission rejects.
  • Ignoring cache on multi-GPU tensor parallel. Prefix blocks must replicate or shard consistently across ranks.
  • Confusing prefix cache with speculative decoding. Speculative decoding speeds decode; prefix cache speeds prefill.
  • No invalidation on model swap. KV from quantized vs FP16 weights must not mix.
  • Measuring hit rate without route labels. Aggregate 40% hit can hide a critical path at 5%.
  • Dynamic content before static header. Putting timestamps in the first 100 tokens kills trunk sharing.

Production checklist

  • Confirm engine support (vLLM enable_prefix_caching, SGLang radix).
  • Require PagedAttention block pools sized for cache + active sequences.
  • Layout prompts static-first, dynamic-last (system, tools, docs, then user).
  • Canonicalize serialized tool schemas and system strings across clients.
  • Set cache budget and eviction policy; pin only proven hot prefixes.
  • Export prefix hit ratio, TTFT, eviction, and pool utilization metrics.
  • Version cache keys when glossary, tools, or model revision changes.
  • Isolate tenant or route subtrees in multi-tenant deploys.
  • Load-test cache cold-start vs warm-prefix concurrency separately.
  • Pair with chunked prefill on long cache-miss suffixes.
  • Document expected hit ratio per route for capacity planning.

Key takeaways

  • Serving prefix cache reuses GPU KV blocks across HTTP requests; it is distinct from per-request KV cache and cloud API prompt caching.
  • RadixAttention and vLLM automatic prefix caching walk token-prefix trees to skip redundant prefill on shared headers.
  • Hit ratio only matters where prefixes actually repeat — route-level metrics expose real savings.
  • Cache consumes the same block pool as live traffic; pin hot prefixes with a reserved budget and eviction policy.
  • Harbor Analytics cut median TTFT 68% by enabling prefix caching, canonicalizing tool schemas, and static-first prompt layout.

Related reading