Guide

LLM KV cache compression and eviction explained

Harbor Analytics' month-end RAG agent ingests 18,000-token policy PDFs plus 6,000 tokens of tool schemas on every session. After PagedAttention eliminated fragmentation OOMs, the bottleneck shifted: a single 32K-context session on Llama-3-class 70B still needed 38 GB of KV cache at FP16 — more than half an A100 80GB after weights loaded. Admission rejected 41% of concurrent agent jobs at month-end peak even though compute utilization sat at 55%.

KV cache compression and eviction attacks that memory wall without necessarily shortening the user-visible context window. Techniques range from lossless precision cuts (FP8 KV) to heuristic token dropping (H2O, StreamingLLM) to structured prefix clustering (SnapKV). The right stack depends on whether you need exact replay of full attention or can tolerate bounded approximation on middle tokens. This guide explains how KV memory scales, major compression families, serving-engine integration, the Harbor Analytics refactor, a technique decision table versus full-context and RAG-only approaches, pitfalls, and a production checklist — alongside our guides on KV cache fundamentals, attention sinks, and sliding-window attention.

Why KV cache dominates long-context serving

During decode, each new token attends to all prior tokens. The engine stores per-layer key and value tensors for every token already processed. Rough memory per sequence scales as:

2 × num_layers × seq_len × num_kv_heads × head_dim × bytes_per_element

GQA and MQA shrink num_kv_heads, and FP8 KV halves bytes_per_element, but linear growth in seq_len remains. A 70B model at 32K tokens can exceed the weight footprint in KV alone. PagedAttention improves utilization of allocated blocks; compression reduces bytes per stored token or the count of stored tokens.

Phase KV growth Typical bottleneck
Prefill Writes KV for entire prompt in one or few passes Compute (FLOPs), peak activation memory
Decode Appends one token's KV per step Memory bandwidth reading full KV history
Multi-turn chat KV accumulates across turns if not truncated Pool exhaustion despite paging

Compression families

1. Precision reduction (lossy but uniform)

Store K and V in FP8, INT8, or INT4 instead of FP16/BF16. Every token remains; only representation shrinks. Quality impact is usually modest on short and medium contexts but can surface on needle-in-haystack tasks past 64K. Validate on golden long-context sets before enabling fleet-wide.

2. Attention sinks and sliding windows

Attention sinks observe that initial tokens receive disproportionate attention mass even when semantically empty (BOS, system preamble). StreamingLLM keeps those sink tokens plus a trailing window of the most recent W tokens and drops middle KV. Memory becomes O(sinks + W) instead of O(seq_len). Works well for streaming dictation and chat where recent context matters most; risky for document QA where a fact sits in the discarded middle.

3. Heavy-hitter eviction (H2O)

H2O tracks cumulative attention scores per token during generation and retains a budget of “heavy hitter” tokens plus recent tokens. Unlike fixed windows, retained tokens adapt to what the model actually attends to. Implementation cost is higher: you need running attention statistics and eviction passes that rewrite block tables in paged engines.

4. Prefix clustering (SnapKV)

Long shared prefixes — system prompts, tool definitions, retrieved chunks — often contain redundant KV structure. SnapKV clusters prefix positions and stores representative KV entries, expanding back at attention time. Pairs naturally with prompt caching and prefix block sharing in vLLM: compress once, reuse across sessions.

5. External memory / RAG offload

Not KV compression per se, but the architectural alternative: keep a short working KV window and re-retrieve passages each turn. See RAG chunking when the product tolerates retrieval latency over infinite in-context memory.

Eviction policy design

Production eviction is a policy problem, not just an algorithm pick:

  • Budget tokens — hard cap on stored KV (e.g. 8K) with guaranteed slots for system prefix and last N user turns.
  • Protected ranges — never evict tool-schema blocks or citation spans the UI depends on.
  • Trigger point — evict when allocated blocks exceed 85% of pool, not only at OOM.
  • Quality gate — run needle tests after policy changes; regression in long-doc F1 beats theoretical memory savings.
  • Rollback — per-request flag to disable eviction for compliance audits requiring full verbatim context replay.

In paged engines, eviction frees physical blocks but must preserve logical token order for positions the model still attends to. Engines that support sliding-window models natively encode window size in the architecture; eviction on full-attention models is an serving-layer approximation.

Harbor Analytics long-RAG refactor

Harbor's agent jobs mixed 12–32K token contexts. The refactor stacked techniques instead of betting on one:

  1. Baseline — measured KV bytes per token, block pool exhaustion rate, and long-doc answer F1 on 120 golden questions.
  2. FP8 KV — enabled in vLLM; F1 delta < 0.4% on golden set; immediate 47% KV byte reduction.
  3. Prefix SnapKV — compressed 4,200-token shared month-end template; 31% fewer prefix blocks per session.
  4. H2O budget — 10,240-token retain cap with 256-token sink + 2,048-token recent tail; middle evicted by attention score.
  5. Admission — estimated KV blocks now use compressed budget, not raw max_model_len.
  6. Monitoring — alert when eviction rate > 30% of tokens per session or long-doc F1 drops > 2% week-over-week.

Month-end rejection rate fell from 41% to 6%; median concurrent 32K sessions per GPU rose from 2.1 to 5.4. Needle-in-haystack recall at 24K context held at 94% versus 96% full KV — acceptable for Harbor's policy QA use case.

Technique decision table

Your situation Prefer Avoid
Contexts under 8K, memory not binding FP16 KV + paging only H2O complexity
Uniform byte savings, quality-sensitive FP8 KV with golden regression tests Aggressive middle eviction
Streaming chat, recent context dominates Sinks + sliding window (StreamingLLM) Full-length KV on 128K windows
Long doc QA, facts anywhere in file RAG with smaller KV window, or H2O with high budget Fixed small window without retrieval
Identical system prefix across tenants SnapKV + prompt caching + prefix block share Per-session full prefix KV
Compliance requires full audit trail Store transcript externally; KV eviction OK if logs complete Silent eviction without session record
Provider API only Shorter context + RAG; use provider prompt caching Custom H2O on client

Common pitfalls

  • Eviction without quality monitoring. Memory graphs look great while long-doc accuracy collapses.
  • Tiny sliding windows on RAG agents. Retrieved chunk sits outside the window by turn three.
  • FP8 KV on needle benchmarks only. Average QA hides tail failures on rare entity lookups.
  • Evicting tool-schema tokens. Model forgets JSON shape mid-generation.
  • Ignoring prefill KV peak. Compression helps decode pool but prefill can still OOM on single-shot 100K uploads.
  • Mixing eviction with wrong architecture. Sliding-window models need native support; do not bolt H2O onto Mamba without validation.
  • No per-tenant protected prefixes. Multi-tenant sink collisions evict another user's system prompt in shared pools.
  • Admission math still uses max_model_len. Pool fills despite compression logic.

Production checklist

  • Measure KV bytes per token and block pool utilization before tuning.
  • Enable FP8 KV first if engine supports it; run long-context golden tests.
  • Define protected token ranges (system, tools, citations).
  • Pick eviction family matched to product (streaming vs document QA).
  • Integrate compressed budget into paged admission formulas.
  • Compress shared prefixes with SnapKV or prefix block sharing.
  • Monitor eviction rate, F1/recall, and OOM rate weekly.
  • Expose a full-context bypass flag for compliance sessions.
  • Document quality tradeoffs in runbooks for support and sales.
  • Re-benchmark after model swaps (layers, heads, context length change).
  • Pair compression with RAG when context needs exceed physical KV budget.

Key takeaways

  • KV cache memory grows linearly with context length and often exceeds weight memory before compute saturates on long-context workloads.
  • Compression stacks: FP8 precision cuts, sink+window eviction, H2O adaptive retention, and SnapKV prefix clustering address different bottlenecks.
  • Eviction is an approximation — protect prefixes, monitor long-doc quality, and keep audit logs independent of KV lifetime.
  • Harbor Analytics cut month-end admission rejections from 41% to 6% by combining FP8 KV, SnapKV prefixes, and H2O budgets on paged serving.
  • When facts must remain addressable anywhere in a document, prefer RAG or high H2O budgets over small fixed windows.

Related reading