Guide

LLM KV cache explained: prefill, decode and inference memory

When a transformer generates text, it does not recompute every prior token from scratch on each step. Instead it stores intermediate key and value tensors from self-attention — the KV cache — so later tokens can attend to earlier ones in constant time per layer. That optimization is why chat feels fast after the first word and why a 100K-token prompt can exhaust GPU memory even when the model weights fit comfortably. This guide explains prefill vs decode, how cache memory scales, grouped-query attention (GQA), PagedAttention, provider-level prompt caching, and practical patterns for multi-turn apps — tying cache economics to context window limits, transformer internals, and on-device inference constraints as edge NPUs push more workloads local.

Why transformers need a KV cache

Self-attention compares every token to every other token. For position t, the model projects hidden states into query (Q), key (K), and value (V) vectors. The attention score for token t attending to token s is a function of Q_t and K_s; the output blends V_s weighted by those scores.

During autoregressive decoding, you generate one new token at a time. Token 500 must attend to tokens 1–499. Recomputing K and V for all 499 predecessors on every step would make generation O(n²) in sequence length — unusably slow. The KV cache stores K and V for all processed positions so each new step only computes Q, K, and V for the new token and reuses cached tensors for history.

Without caching, a 2,000-token completion on a 70B model would be impractical. With caching, decode becomes roughly linear in output length for compute — but memory still grows with total context, which is the bottleneck teams hit first.

Prefill vs decode: two phases of inference

Serving systems split each request into distinct phases with different performance profiles:

Prefill (prompt processing)

Prefill ingests the entire input prompt — system message, retrieved documents, prior chat turns — in one or a few batched forward passes. Compute intensity is high: every token attends to every other token in the prompt (within causal masking). Prefill dominates time-to-first-token (TTFT). A 50-page PDF dumped into the prompt feels slow before the first character appears because prefill must finish building the full KV cache for all those tokens.

Decode (token generation)

Decode generates output one token at a time. Each step is relatively cheap in FLOPs — one new token against a growing cache — but you run thousands of sequential steps for a long answer. Decode dominates tokens-per-second and total latency for verbose completions. Users perceive decode as "streaming speed."

Optimizing a chat product requires tuning both: RAG pipelines should trim prefill bulk, while decoding benefits from speculative decoding, batching, and quantization (see our quantization guide).

How much memory does the KV cache use?

Rough order-of-magnitude for FP16 KV storage per token per layer:

bytes_per_token ≈ 2 × num_layers × num_kv_heads × head_dim × 2 bytes

The leading factor of 2 is K and V tensors. Multiply by total sequence length (prompt + generated tokens) and by batch size (concurrent requests sharing a GPU). For a 32-layer model with 8 KV heads and head dimension 128 in FP16:

2 × 32 × 8 × 128 × 2 = 131,072 bytes ≈ 128 KB per token

At 128K context that is roughly 16 GB of KV cache alone — before weights, activations, or CUDA overhead. This is why "context window" marketing numbers often exceed what you can actually run on a single consumer GPU without offloading or compression.

What shrinks the cache

Grouped-query attention (GQA) and multi-query attention (MQA) — share one K/V head group across many query heads, cutting KV memory by 4–8× vs full multi-head attention.
Lower precision — FP8 or INT8 KV caches trade slight quality loss for halved memory.
Shorter effective context — summarization, sliding windows, or retrieval instead of stuffing full history.
Smaller models — fewer layers and heads linearly reduce cache footprint.

GQA, MQA, and why model architecture matters

Early transformer LLMs used standard multi-head attention: every query head had its own K and V head. Multi-query attention (MQA) uses a single K/V head pair for all query heads — maximal memory savings, sometimes at a small quality cost. Grouped-query attention (GQA) is the compromise most production models (Llama 3, Mistral, many GPT-class stacks) adopted: a few K/V head groups, each shared by a subset of query heads.

When comparing open models for self-hosting, check not only parameter count but num_key_value_heads in the config. A 13B GQA model can serve longer contexts on the same GPU than a 13B MHA legacy architecture. Architecture choices made at training time directly determine your serving bill at inference time.

PagedAttention and efficient serving

Naive serving allocates one contiguous GPU buffer sized for each request's maximum context. That fragments memory: a 2K-token chat and a 90K-token agent job leave huge holes when shorter requests finish. PagedAttention (popularized by the vLLM serving framework) stores KV cache in fixed-size blocks analogous to OS virtual memory pages. Blocks map non-contiguously, so batching heterogeneous request lengths improves GPU utilization.

Production inference stacks — vLLM, TensorRT-LLM, TGI, SGLang — all implement variants of block-based KV management plus continuous batching, which inserts new requests into a running batch as others complete rather than waiting for the whole batch to finish. Together these techniques are why API providers can offer high throughput at seemingly low per-token prices: the hardware is amortized across many concurrent caches.

API prompt caching (OpenAI, Anthropic, others)

Cloud providers now expose prompt caching explicitly. You structure requests so a large static prefix — system prompt, tool schemas, documentation corpus — is marked cacheable. On subsequent calls with the same prefix, the provider reuses stored KV states instead of recomputing prefill, cutting cost and TTFT dramatically.

Typical patterns:

Put stable content first (system instructions, JSON tool definitions, long reference docs).
Put variable content last (user message, session-specific state).
Keep cache breakpoints aligned — changing one byte in the prefix invalidates the cache.
Monitor cache hit metrics; misses often mean accidental nondeterminism (timestamps in system prompts, unordered JSON keys).

Cached tokens are billed at a reduced rate on many APIs, but minimum cacheable lengths apply (often 1,024+ tokens). Small prompts see no benefit. Agent frameworks that resend the same 20K-token tool manifest every turn without caching can spend 10× what a cache-aware client would.

Multi-turn chat: what actually grows

Each assistant reply appends tokens to the context. Turn 10 of a support chat may carry the full transcript unless you prune. Strategies:

Rolling window — keep the last N tokens of history; simplest but loses early facts.
Summary buffer — periodically compress older turns into a running summary message (cheaper cache than verbatim replay).
Retrieval over history — embed past turns, fetch only relevant snippets (pairs with RAG).
Server-side session cache — some providers let you pass a cache key or conversation ID so KV state persists server-side between HTTP calls without resending text.

The worst pattern is naive "append everything to messages[]" in long agent loops. Memory, cost, and latency compound until you hit the context ceiling mid-task.

KV cache on edge and mobile devices

On-device LLMs — Apple Intelligence, Gemini Nano, local llama.cpp builds — face tighter RAM ceilings than datacenter GPUs. A 3B–8B quantized model may fit in 4–8 GB total, leaving little room for a 32K KV cache. Edge runtimes use aggressive quantization, shorter default contexts, and sometimes KV cache quantization or eviction policies that drop oldest tokens.

Hybrid architectures route easy queries to a local model with a small cache and escalate to cloud APIs for long-context tasks. Understanding KV economics explains why "run the same 70B cloud model on your phone" is not a near-term promise — the cache scales with context, and phone RAM does not. See our edge AI guide for NPU and hybrid routing detail.

Common mistakes

Resending megabyte prompts every turn without provider caching or prefix deduplication.
Assuming context window equals usable context — KV memory may OOM well below the advertised token cap.
Ignoring batch size — eight concurrent 64K sessions are not eight times one session; they share GPU RAM.
Mutable system prompts — dynamic timestamps or random few-shot examples prevent cache hits.
No prefill budget in SLAs — streaming UX hides decode latency but not a 30-second first token on huge RAG dumps.

Production checklist

Measure TTFT (prefill) and tokens/sec (decode) separately in benchmarks.
Structure prompts with stable prefixes first; enable provider prompt caching where available.
Track num_key_value_heads when choosing self-hosted models.
Cap effective history length; summarize or retrieve instead of unbounded append.
Size GPU memory as weights + peak KV cache + batch overhead — not weights alone.
Use serving frameworks with PagedAttention and continuous batching for multi-tenant loads.
Re-evaluate after quantization — INT4 weights help, but KV cache may become the new bottleneck.
Test edge deployments at realistic context lengths, not just single-turn queries.

Key takeaways

KV cache stores attention keys and values so decode does not recompute history.
Prefill processes the prompt and dominates time-to-first-token; decode streams tokens and dominates long answers.
Cache memory scales with layers × KV heads × head dim × sequence length × batch.
GQA/MQA and FP8 KV caches reduce memory; PagedAttention improves utilization.
Prompt caching on APIs rewards stable prefixes and punishes chaotic system prompts.