Guide
LLM agent semantic caching and similar-query response reuse systems explained
Harbor Support runs a tier-1 customer agent: classify intent, retrieve account context, draft a reply, and escalate billing disputes. Analysts noticed 41% of resolved runs were full generations for questions that differed only in phrasing — “how do I reset my password,” “forgot login,” “can’t sign in again” each burned the same tool chain and a 2,400-token completion. Exact-string caches hit less than 3% of traffic because users rarely repeat wording byte-for-byte. Prefix caches saved prefill on shared system prompts but did nothing for paraphrased user turns. After deploying a semantic response cache keyed on normalized embeddings with tenant-scoped namespaces and freshness gates, duplicate full generations fell from 41% to 6.2%, median cost per tier-1 resolution dropped $0.38 to $0.14, and p95 latency improved by 1.8 seconds — without increasing wrong-answer escalations.
Semantic caching (also called similar-query response reuse) stores prior agent outputs indexed by embedding vectors so near-duplicate questions can skip model generation when similarity, policy, and freshness checks pass. It sits above tool result caches (which dedupe API fetches) and beside prompt prefix caches (which dedupe static instruction prefill). This guide covers lookup pipelines, cache key composition, similarity thresholds, safety exclusions, invalidation, the Harbor Support refactor, a decision table versus adjacent techniques, pitfalls, and a production checklist tied to cost attribution and RAG retrieval.
Why exact caches miss paraphrases
Production agents see high lexical variance on the same intent:
- Typos and autocorrect — “refund pls” vs “please issue a refund”
- Synonyms — “cancel subscription” vs “stop billing”
- Multilingual mixing — same question in English and Spanish
- Context padding — users paste order IDs, screenshots, or rage paragraphs around a one-line ask
Exact-match caches (hash of normalized UTF-8) and idempotency dedup only collapse byte-identical replays. Semantic caches embed the intent-bearing slice of the user message, search a vector index for neighbors above a similarity threshold, and return a stored response when policy allows. The tradeoff is staleness risk: a cached answer about return policy v3 must not serve after v4 ships. Freshness classes and namespace versioning solve that — not blind TTL alone.
Architecture: four layers
Mature semantic cache stacks separate concerns so false hits are rare and debuggable:
1. Query normalizer
Strip boilerplate before embedding: trim whitespace, lowercase (locale-aware where needed), remove PII patterns (emails, card numbers) into placeholders, collapse repeated punctuation, and optionally run a lightweight intent classifier so “billing” and “technical” lanes never cross-match. Log both raw and normalized text on the span for audits.
2. Embedding and index
Compute embedding = embed(normalized_query) with a fixed model
version pinned in config. Store vectors in a per-tenant index (HNSW, IVF, or
managed vector DB). Each record includes:
cache_id, embedding_model@version,
namespace, similarity_score_at_write,
response_payload, source_run_id,
freshness_class, expires_at, and
policy_tags (e.g. no_pii, faq_tier1).
3. Similarity gate
On lookup, retrieve top-k neighbors (k=3–5). Accept a hit only if:
- Cosine similarity ≥
threshold(typical 0.92–0.97 for support FAQs; tune per lane) - Same
namespace(tenant + agent route + template version) - Entry not expired and freshness class still valid
- No
do_not_cachetag on the incoming query - Optional: second-stage cross-encoder rerank when k candidates tie near threshold
lookup = semantic_cache.get(normalized, namespace="support/tier1@v2.3")
if lookup.hit:
trace.set("cache.semantic", "HIT", similarity=lookup.score)
return adapt_cached_response(lookup.payload, user_context)
else:
response = run_full_agent(...)
if cacheable(response):
semantic_cache.put(normalized, response, namespace=...)
4. Response adapter
Cached text is rarely returned verbatim. An adapter merges stored answer skeleton with live slots: user name, order status from a quick tool read, current date. Heavy personalization stays out of the cache body; only stable FAQ prose is cached. If adaptation needs more than one cheap tool, bypass cache and regenerate.
Cache key namespaces and versioning
Semantic similarity without namespace isolation causes cross-contamination:
| Namespace component | Why it must be in the key | Invalidation trigger |
|---|---|---|
| Tenant ID | Same question, different product policies | Tenant offboarding |
| Agent route | Billing vs technical intents overlap lexically | Route split or rename |
| Template / policy version | Answer text must match current rules | Template promotion |
| Embedding model version | Vectors incomparable across models | Model upgrade → rebuild index |
| Locale / language | Cross-language match may be undesired | Locale pack update |
Bump namespace suffix on promotion (e.g. support/tier1@v2.3 →
v2.4) instead of deleting vectors in place. Old namespaces age out
via TTL while in-flight runs finish.
Freshness classes and when not to cache
Not every successful run should write to the semantic index:
- STATIC_FAQ — policy answers stable for days/weeks; long TTL; safe for semantic reuse
- SEMI_DYNAMIC — prices, inventory, account-specific facts; short TTL (minutes) or adapter-only slots
- EPHEMERAL — time-bound promos, incident banners; never cache or TTL < 60s
- PERSONALIZED — contains user PII in the answer body; do not store; lookup-only from anonymized queries
- TOOL_WRITE — run invoked a mutating tool; block cache write ( dedup handles retries, not semantic reuse)
Classify at write time using route metadata, tool audit log, and output guardrails. A classifier misfire that caches a billing adjustment instruction is worse than skipping cache entirely.
Write path, quality gates, and negative caching
Cache writes should pass quality gates so bad answers do not amplify:
- Run completed without escalation or human override
- User did not thumbs-down within a feedback window (if available)
- Guardrails passed; no policy violation flags
- Response length and structure within expected bounds for the route
- Similarity to existing entry: if a neighbor already exists above 0.98, update metadata instead of duplicating vectors
Negative caching (storing “no good answer” markers) is usually avoided in agent systems — a later model or tool fix should retry. Exception: confirmed out-of-scope intents with stable refusal templates.
Harbor Support refactor
Harbor’s before state: only exact Redis keys on normalized user text; prefix cache on system prompt; no embedding index. Paraphrased tier-1 FAQs re-ran retrieval + generation. After state:
- Query normalizer with PII redaction and intent lane tag
- Per-tenant HNSW index on
text-embedding-3-small@2024-01(pinned) - Namespace =
{tenant}/support/tier1@{template_version} - Similarity threshold 0.94 for FAQ lane; 0.97 for billing (stricter)
- Response adapter filling order status via one read-only tool
- Write gates tied to no-escalation + guardrail pass
- Metrics on hit rate, false-hit proxy (re-ask within 5 min), cost per resolution
Results after eight weeks: duplicate full generations 41% → 6.2%, semantic hit rate stabilized at 34% of tier-1 volume, median cost $0.38 → $0.14, p95 latency 4.1s → 2.3s, and wrong-answer escalation rate unchanged (within measurement noise). Template promotions now bump namespace automatically via the template rollout controller.
Decision table: semantic cache vs adjacent techniques
| Approach | Primary win | When semantic cache is better | When the alternative wins |
|---|---|---|---|
| Exact string cache | Zero false hits; trivial implementation | High paraphrase rate; FAQ/support tiers | Low lexical variance; strict byte identity needed |
| Prompt prefix cache | Cheaper prefill on static instructions | Skip full completion on similar user questions | Cost is mostly system prompt, not user turn |
| Tool result cache | Avoid duplicate API/DB fetches | Same question, same answer text after tools | Bottleneck is tool latency, not generation |
| RAG retrieval | Ground answers in source documents | Reuse prior composed answers with adapter slots | Source docs change often; need citations every time |
| Smaller / faster model routing | Lower cost per token | Near-duplicate hits avoid any model call | Novel complex reasoning every turn |
Production stacks layer all rows: RAG for grounding, tool cache for fetches, prefix cache for instructions, semantic cache for paraphrased tier-1 resolutions, and full generation for novel or high-risk lanes.
Common pitfalls
- Global index without tenant namespaces — Tenant A’s refund policy serves Tenant B.
- Threshold too low — “cancel order” matches “cancel subscription”; tune per lane.
- Embedding model drift — Reindex on model change or hits silently degrade.
- Caching personalized bodies — PII leakage across users; cache anonymized templates only.
- Ignoring template version in namespace — Stale policy answers after prompt promotion.
- No adapter for dynamic slots — Cached “your order ships tomorrow” ages badly.
- Writing cache on failed guardrails — Amplifies bad outputs across paraphrases.
- Missing observability — Cannot distinguish hit rate from false-hit re-asks without trace tags.
Production checklist
- Normalize queries (PII redaction, intent lane) before embedding.
- Pin
embedding_model@version; plan reindex on upgrade. - Scope namespaces: tenant + route + template version + locale.
- Set per-lane similarity thresholds; use cross-encoder tie-break near cutoff.
- Classify freshness (STATIC_FAQ vs SEMI_DYNAMIC vs EPHEMERAL); block writes on TOOL_WRITE runs.
- Adapt cached responses with live read-only slots; cap adapter tool calls.
- Gate writes on guardrail pass, no escalation, and optional user feedback.
- Bump namespace on template promotion; TTL-retire old namespaces.
- Emit
cache.semantic=HIT|MISS, similarity score, andcache_idon traces. - Track false-hit proxy: same user re-asks within N minutes after a hit.
- Integrate hit savings into per-run cost ledgers.
- Load-test index latency at peak QPS; set p99 lookup SLO.
Key takeaways
- Paraphrases dominate support traffic; exact caches leave most savings on the table.
- Namespaces must include tenant, route, and template version to prevent cross-policy hits.
- Freshness classes and write gates matter more than similarity tuning alone.
- Adapt cached skeletons with live slots instead of serving frozen personalized text.
- Harbor Support cut duplicate generations from 41% to 6.2% with embedding-indexed semantic reuse.
Related reading
- LLM agent prompt caching systems explained — prefix reuse, cache keys and prefill cost control
- LLM agent tool result caching systems explained — exact keys, TTL classes and safe staleness
- LLM agent cost attribution and token accounting explained — per-run budgets, tool metering and FinOps
- LLM agent RAG retrieval pipeline systems explained — chunking, reranking and grounded generation