Guide

LLM prompt caching explained: prefix reuse, cost savings and API design

Every LLM API call starts with a prefill pass: the model reads your entire prompt before generating the first output token. If thousands of users send the same 8,000-token system prompt, tool schema, and knowledge-base preamble on every request, the provider recomputes identical attention work again and again. Prompt caching (also called prefix caching) stores the computed internal state for a reusable prefix so subsequent requests skip most of that work — cutting latency and, on many platforms, charging a discounted rate for cached input tokens. This is distinct from the per-request KV cache inside a single generation; prompt caching persists across separate HTTP calls. This guide explains how major APIs implement it, how pricing differs from uncached tokens, where to place cache breakpoints in RAG pipelines, and architectural patterns that maximize hit rate without breaking context budgets.

Prompt caching vs KV cache

Inside one inference run, transformers cache key and value tensors so each new token does not re-attend to all prior tokens from scratch. That in-flight KV cache lives for the duration of a single request and is freed when generation ends.

Provider prompt caching operates one level higher: the API remembers a hash of a long, identical prefix (system instructions, static documents, JSON tool definitions) and reuses the precomputed KV state when another request shares that prefix. You still pay for tokens, but often at a lower cached-input rate, and time-to- first-token (TTFT) drops because the provider skips re-running attention on tens of thousands of already-seen tokens.

Think of KV cache as scratch paper for one answer; prompt caching is a shared whiteboard in the provider’s data center that many customers can read from if they wrote the same opening paragraphs.

How providers implement prefix caching

Implementations differ, but the mental model is consistent:

  1. You send a prompt whose leading tokens are byte-identical to a prior request (same model, same tokenizer).
  2. The provider matches a cache key derived from that prefix (often a rolling hash of token IDs).
  3. On a hit, prefill work for the matched prefix is skipped or shortened; only new suffix tokens run full attention.
  4. Response metadata reports how many input tokens were served from cache vs computed fresh.

Minimum prefix length. Caching usually kicks in only above a threshold — commonly 1,024 to 2,048 tokens — because hashing and storage overhead is not worth it for short prompts. A 200-token system message alone may never cache; a 6,000-token policy manual plus tool schema will.

Cache breakpoints. Some APIs let you mark explicit boundaries with special message fields or cache-control headers so static content is grouped separately from per-user variables. Place breakpoints after everything that stays constant across users and before session-specific text (user ID, live query, retrieved chunks that change every call).

TTL and eviction. Caches are not permanent. Typical retention ranges from minutes to about an hour of inactivity; popular prefixes (widely shared system prompts) may stay warm longer. Do not assume a cache hit on the first request — the initial call usually pays full prefill cost to populate the entry.

Token pricing: cached vs uncached input

Billing models increasingly split input tokens:

  • Uncached input — full list price for tokens the model must prefill from scratch.
  • Cached input — discounted rate (often 50–90% off) for prefix tokens served from cache.
  • Output tokens — typically unchanged; caching does not discount generation.

Example economics: a support bot sends a 10,000-token policy pack plus 500 tokens of user-specific context per ticket. Without caching, you pay list price on 10,500 input tokens every time. With a warm cache on the 10,000-token static block, you pay full price on 500 fresh tokens plus the reduced cached rate on 10,000 — a dramatic drop at high volume. Always read the provider’s current price sheet; cached-token discounts and minimum lengths change between model releases.

Instrument your app: log cache_creation_input_tokens, cache_read_input_tokens, or equivalent fields from API responses. A RAG app with 5% cache hit rate is leaving money on the table; 70%+ on the static prefix is a healthy target for multi-tenant SaaS with shared instructions.

Prompt layout patterns that maximize cache hits

Static-first ordering

Put immutable content at the start of the prompt: system persona, safety policies, output JSON schema, few-shot examples that never change, and stable tool definitions. Append volatile content last: current timestamp, user message, retrieved document chunks, conversation tail. Any byte change in the prefix invalidates the cache for everything after the edit point.

RAG document placement

In retrieval-augmented generation, retrieved chunks usually change per query — they belong after a cache breakpoint, not inside the static system block. If multiple users query the same corpus slice (e.g., a shared product manual section), identical retrieval results can extend the cacheable prefix, but that is rare; design for per-query suffix variance.

Agents and tool schemas

Agent frameworks often inject large OpenAPI-style tool lists. Keep tool JSON stable (sorted keys, consistent whitespace) and version it explicitly: bump a tools_version string in the static block when definitions change so you are not surprised by silent cache invalidation. Dynamic tool results belong in user or tool-role messages after the breakpoint.

Multi-turn conversations

Growing chat history appends to the suffix, which is correct — only the shared prefix should cache. Some teams restructure long threads: summarize older turns into a static “memory” block (cacheable if identical across sessions) plus a short recent window. Pair with agent memory tiers so you do not blow the context window re-sending full logs.

Latency and throughput gains

Cost savings are only half the story. Prefill is often the dominant latency for long prompts — especially on served models under load. Skipping 20,000 tokens of prefill can shave hundreds of milliseconds to multiple seconds off TTFT, which matters for interactive chat and agent loops that issue dozens of model calls per user action.

Throughput improves because GPU cycles previously spent on redundant prefix attention can serve other customers. During traffic spikes, apps with high cache hit rates see fewer queue delays — another reason to treat prompt layout as infrastructure, not prompt-engineering trivia.

Pitfalls and security considerations

  • Invisible invalidation. Trailing whitespace, model version bumps, or temperature in the wrong field can change tokenization and miss cache silently. Pin model IDs and normalize prompt templates in CI.
  • Over-caching secrets. Do not put API keys or per-user PII in the static prefix just to chase hit rate — cached blocks may persist on shared infrastructure. Keep secrets out of prompts entirely.
  • Cross-tenant leakage (provider responsibility). Reputable providers isolate cache entries cryptographically; still avoid putting one customer’s private data in a prefix shared with another tenant’s requests.
  • False economy on tiny prompts. Below the minimum token threshold, caching adds complexity with no benefit. Measure before refactoring.
  • Stale instructions. A warm cache can serve an old system prompt for minutes after you deploy new policy text until the entry expires. For safety-critical updates, change a version string in the prefix to force invalidation.

Self-hosted and open-weight inference

If you run vLLM, TGI, or similar on your own GPUs, provider-style cross-request prompt caching may be off by default but analogous features exist: automatic prefix caching in recent vLLM releases, RadixAttention, and session-based KV reuse for multi-turn chats on the same server instance. Economics differ — you save compute, not API list price — but the same static-first layout principles apply.

Production checklist

  • Audit prompts: identify the longest byte-stable prefix shared across requests.
  • Reorder messages so static content precedes per-user and per-query suffixes.
  • Set explicit cache breakpoints if your API supports them.
  • Log cached vs uncached token counts on every call; dashboard hit rate weekly.
  • Version system prompts (policy_v3) and bump on deploy to control staleness.
  • Normalize JSON tool schemas (sorted keys, no pretty-print drift).
  • Keep retrieved RAG chunks and chat tails after breakpoints, not inside static blocks.
  • Load-test TTFT with cold vs warm cache; set SLOs on both paths.
  • Compare total cost against smaller models or quantization if cache hit rate stays low.
  • Document which model IDs and regions support caching — features vary by endpoint.

Key takeaways

  • Prompt caching reuses computed prefix state across API calls; KV cache only helps within one generation.
  • Static-first prompt layout and explicit breakpoints maximize hit rate for agents, RAG, and support bots.
  • Cached input tokens are billed at a discount; output tokens generally are not.
  • First request populates cache; benefits appear on subsequent identical-prefix calls within TTL.
  • Instrument cache read/create metrics — layout optimization is a cost and latency lever, not optional polish.
  • Version static blocks deliberately when policies change to avoid serving stale instructions.

Related reading