Guide
LLM agent prompt caching systems explained
Harbor Analytics’ compliance agent sent the same 18 KB policy corpus, tool JSON schemas, and few-shot examples on every turn of a 12-step investigation. Prefill tokens accounted for 71% of total token spend even though the prefix barely changed between steps. Median run cost hit $3.80 while investigators waited on redundant GPU work. After restructuring prompts into cacheable static prefixes and dynamic suffixes, enabling provider-level prompt caching, and tying cache keys to release fingerprints, prefill share fell to 34%, median run cost dropped to $1.82, and task success held at 94%. Prompt caching is not a model feature you toggle once — it is a systems design choice about what stays identical across turns, how you key and invalidate that identity, and how you prevent one tenant’s prefix from warming another tenant’s cache incorrectly.
Prompt caching (also called prefix caching or KV-cache reuse) lets providers skip recomputing attention over repeated leading tokens. For agents that loop tools across dozens of turns, the savings compound on the static head of each request — system instructions, tool definitions, retrieved documents that do not change mid-run, and long few-shot packs. This guide covers prompt segmentation, provider cache semantics, cache key design, invalidation and TTL, multi-tenant isolation, integration with context budgets and cost attribution, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
What prompt caching is and why agents need it
Large language models process prompts in two cost phases: prefill (ingest the full context, build internal key-value caches) and decode (generate new tokens one at a time). Agent loops are prefill-heavy: each tool round-trip resends the entire conversation, tool outputs, and schemas. Without caching, you pay full prefill price on bytes that are bitwise identical to the previous turn.
Provider prompt caches store the computed KV state for a prefix — a contiguous leading block of tokens. When a new request shares that exact prefix under the same cache key, the provider reuses stored state and bills cached input tokens at a reduced rate (often 50–90% cheaper than uncached prefill, depending on vendor and tier).
Caching complements but does not replace tool result summarization or context budgeting. Compression shrinks what you send; caching makes repeated sends cheap. The highest-leverage agents do both.
Segmenting prompts: static, semi-static, and dynamic
Production agent prompts should be assembled in explicit layers so engineers know what can sit in a cacheable prefix:
- Static layer — system instructions, safety policy, brand voice, global tool-use rules. Changes only on deploy. Ideal cache candidate.
- Semi-static layer — tool JSON schemas, few-shot trajectories, RAG document packs pinned for the duration of a run. Cacheable within a session or until document version bumps.
- Dynamic layer — user message, rolling conversation, fresh tool results, timestamps, session IDs. Must live in the suffix after the cache breakpoint.
Providers that support explicit cache breakpoints (for example
Anthropic’s cache_control blocks) let you mark
up to several boundaries per request. Place breakpoints
after the last token that will repeat unchanged on
the next turn. A common mistake is caching through the user message
— the breakpoint moves every turn and the cache never hits.
For multi-tool agents, keep tool schemas in the static or semi-static block. If you add or rename a tool mid-run, the prefix hash changes and the cache invalidates — design tool registries so schema changes coincide with run boundaries.
Cache keys, TTL, and invalidation
A cache hit requires more than identical bytes. Providers implicitly or explicitly key caches on:
- Model ID and revision — switching models misses even with the same text.
- Sampling parameters — some vendors include temperature and top-p in the key; treat decoding config as part of the release fingerprint.
- Tool and schema hash — hash the canonical JSON of your tool manifest; bump the hash on any field change.
- Prompt version — tie system prompts to a semver or git SHA surfaced in canary deployments so rollbacks do not serve stale policy from cache.
TTL varies by provider: ephemeral caches may expire in minutes of disuse; extended tiers hold hours for large prefixes. Architect for graceful miss: a cache miss should cost more, not fail. Log hit rate per layer and alert when hit rate drops after a deploy — often the first signal that someone edited the system prompt without bumping the version tag.
Invalidation triggers you should automate:
- Prompt or tool schema deploy (version bump).
- Model routing change in fallback ladders.
- RAG corpus refresh when retrieved chunks are embedded in the semi-static prefix.
- Regulatory policy updates that alter compliance instructions.
Multi-tenant isolation and security
Shared infrastructure must not let Tenant A warm a cache entry that includes Tenant B data. Rules:
- Never place tenant-specific secrets, API keys, or customer PII in a cross-tenant static prefix. Run PII detection before any block marked cacheable.
- Include tenant_id in semi-static layers when per-tenant tool allowlists or policy overlays differ. Accept lower hit rate across tenants in exchange for isolation.
- Align with tenant isolation boundaries for vector stores and secrets — a cached prefix is still persisted state on the provider side.
- Document data-processing agreements: cached KV state may reside in provider memory longer than a single request.
Observability and FinOps integration
Instrument every agent request with:
cached_input_tokensvsuncached_input_tokensfrom provider usage metadata.- Cache hit rate per static layer version and per agent workflow.
- Prefill share of total cost (prefill + decode) before and after caching.
- Correlation with trace spans so FinOps can attribute savings to a specific prompt refactor.
Feed counters into per-run cost ledgers. A 50% drop in invoice total that does not show up in cached-token metrics is probably routing or model change, not caching.
Pair caching with rate limiting: cheaper prefill can increase effective throughput and mask upstream TPM pressure until you hit decode bottlenecks instead.
Harbor Analytics refactor walkthrough
Harbor’s platform team restructured the compliance investigation agent in five steps:
- Prompt autopsy — logged token counts per layer; discovered 18 KB policy + 6 KB tool schemas repeated on all 12 turns.
- Layer split — moved policy, schemas, and six few-shot examples into a static prefix; user facts and tool JSON results into a dynamic suffix.
- Cache breakpoints — two breakpoints: after policy and after few-shots; RAG snippets pinned per case ID in semi-static block with case-scoped cache key.
- Release fingerprint — middleware hook
prepended
prompt_version=2026.06.04to traces and canary gates; cache misses spiked 4% on deploy then stabilized at 88% hit rate on static layers. - FinOps dashboard — cached-token cost column in run ledger; weekly review compared hit rate vs task success.
Outcomes: prefill share 71% → 34%; median run cost $3.80 → $1.82; p95 latency improved 22% on cache hits; task success 94% unchanged; annualized inference savings $1.1M at Harbor’s volume.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Provider prompt caching | Multi-turn agents with large static prefixes | Single-shot Q&A; prefix changes every turn |
| Context compression / summarization | Shrinking dynamic history | Static bytes still re-sent at full prefill |
| Smaller or distilled models | Uniform quality downgrade acceptable | Complex tool reasoning required |
| Client-side prompt memoization | Self-hosted models with explicit KV APIs | Managed APIs without cache hooks |
| RAG chunk pinning per run | Investigations with stable evidence set | Streaming retrieval every turn |
| Caching + compression together | Long-horizon production agents | Added engineering complexity |
Common pitfalls
- Breakpoint after volatile content — placing the cache mark after the user message guarantees zero hits.
- Silent prompt edits — editing system text without version bump causes stale-policy incidents that are hard to debug.
- PII in cached prefixes — customer data in a shared static block leaks across sessions on some provider implementations.
- Over-caching huge RAG dumps — caching 200 KB of retrieved text per tenant may exceed TTL or storage limits; pin only stable subsets.
- Ignoring decode cost — caching fixes prefill; verbose final answers still burn decode tokens.
- Assuming OpenAI and Anthropic behave identically — read each vendor’s minimum cacheable length, TTL, and billing line items.
- Cache hit rate as vanity metric — 95% hit rate on a 200-token prefix saves little; prioritize bytes under cache.
Production checklist
- Map prompt layers: static, semi-static, dynamic with token counts.
- Place provider cache breakpoints after the last repeating token.
- Hash tool schemas and system prompts; bump version on any change.
- Exclude tenant secrets and PII from cross-tenant static prefixes.
- Log cached vs uncached input tokens on every agent span.
- Dashboard hit rate, prefill share, and cost per successful task.
- Test cache miss path in load tests (cold start latency).
- Coordinate invalidation with canary deploy and rollback runbooks.
- Re-evaluate after adding tools or expanding few-shot packs.
Key takeaways
- Prompt caching reuses KV state for identical prefixes — the biggest wins are multi-turn agents with large static heads.
- Segment prompts deliberately — static policy and tool schemas belong before the cache breakpoint; conversation belongs after.
- Cache keys must include model, schema hash, and prompt version — tie invalidation to deploy fingerprints.
- Combine caching with compression and budgets — they solve different parts of the token bill.
- Harbor Analytics cut median run cost 52% by restructuring prefixes and instrumenting cached-token FinOps, without changing task success.
Related reading
- LLM agent context budget and token management explained — allocation, compression, and per-turn limits
- LLM agent cost attribution and token accounting explained — per-run FinOps and cached-token ledgers
- LLM agent tool result summarization and truncation explained — shrinking dynamic suffixes
- LLM agent middleware hook pipeline explained — version tags and cross-cutting hooks at the inference boundary