Guide

LLM agent semantic caching and similar-query response reuse systems explained

Harbor Support runs a tier-1 customer agent: classify intent, retrieve account context, draft a reply, and escalate billing disputes. Analysts noticed 41% of resolved runs were full generations for questions that differed only in phrasing — “how do I reset my password,” “forgot login,” “can’t sign in again” each burned the same tool chain and a 2,400-token completion. Exact-string caches hit less than 3% of traffic because users rarely repeat wording byte-for-byte. Prefix caches saved prefill on shared system prompts but did nothing for paraphrased user turns. After deploying a semantic response cache keyed on normalized embeddings with tenant-scoped namespaces and freshness gates, duplicate full generations fell from 41% to 6.2%, median cost per tier-1 resolution dropped $0.38 to $0.14, and p95 latency improved by 1.8 seconds — without increasing wrong-answer escalations.

Semantic caching (also called similar-query response reuse) stores prior agent outputs indexed by embedding vectors so near-duplicate questions can skip model generation when similarity, policy, and freshness checks pass. It sits above tool result caches (which dedupe API fetches) and beside prompt prefix caches (which dedupe static instruction prefill). This guide covers lookup pipelines, cache key composition, similarity thresholds, safety exclusions, invalidation, the Harbor Support refactor, a decision table versus adjacent techniques, pitfalls, and a production checklist tied to cost attribution and RAG retrieval.

Why exact caches miss paraphrases

Production agents see high lexical variance on the same intent:

  • Typos and autocorrect — “refund pls” vs “please issue a refund”
  • Synonyms — “cancel subscription” vs “stop billing”
  • Multilingual mixing — same question in English and Spanish
  • Context padding — users paste order IDs, screenshots, or rage paragraphs around a one-line ask

Exact-match caches (hash of normalized UTF-8) and idempotency dedup only collapse byte-identical replays. Semantic caches embed the intent-bearing slice of the user message, search a vector index for neighbors above a similarity threshold, and return a stored response when policy allows. The tradeoff is staleness risk: a cached answer about return policy v3 must not serve after v4 ships. Freshness classes and namespace versioning solve that — not blind TTL alone.

Architecture: four layers

Mature semantic cache stacks separate concerns so false hits are rare and debuggable:

1. Query normalizer

Strip boilerplate before embedding: trim whitespace, lowercase (locale-aware where needed), remove PII patterns (emails, card numbers) into placeholders, collapse repeated punctuation, and optionally run a lightweight intent classifier so “billing” and “technical” lanes never cross-match. Log both raw and normalized text on the span for audits.

2. Embedding and index

Compute embedding = embed(normalized_query) with a fixed model version pinned in config. Store vectors in a per-tenant index (HNSW, IVF, or managed vector DB). Each record includes: cache_id, embedding_model@version, namespace, similarity_score_at_write, response_payload, source_run_id, freshness_class, expires_at, and policy_tags (e.g. no_pii, faq_tier1).

3. Similarity gate

On lookup, retrieve top-k neighbors (k=3–5). Accept a hit only if:

  • Cosine similarity ≥ threshold (typical 0.92–0.97 for support FAQs; tune per lane)
  • Same namespace (tenant + agent route + template version)
  • Entry not expired and freshness class still valid
  • No do_not_cache tag on the incoming query
  • Optional: second-stage cross-encoder rerank when k candidates tie near threshold
lookup = semantic_cache.get(normalized, namespace="support/tier1@v2.3")
if lookup.hit:
    trace.set("cache.semantic", "HIT", similarity=lookup.score)
    return adapt_cached_response(lookup.payload, user_context)
else:
    response = run_full_agent(...)
    if cacheable(response):
        semantic_cache.put(normalized, response, namespace=...)

4. Response adapter

Cached text is rarely returned verbatim. An adapter merges stored answer skeleton with live slots: user name, order status from a quick tool read, current date. Heavy personalization stays out of the cache body; only stable FAQ prose is cached. If adaptation needs more than one cheap tool, bypass cache and regenerate.

Cache key namespaces and versioning

Semantic similarity without namespace isolation causes cross-contamination:

Namespace component Why it must be in the key Invalidation trigger
Tenant ID Same question, different product policies Tenant offboarding
Agent route Billing vs technical intents overlap lexically Route split or rename
Template / policy version Answer text must match current rules Template promotion
Embedding model version Vectors incomparable across models Model upgrade → rebuild index
Locale / language Cross-language match may be undesired Locale pack update

Bump namespace suffix on promotion (e.g. support/tier1@v2.3v2.4) instead of deleting vectors in place. Old namespaces age out via TTL while in-flight runs finish.

Freshness classes and when not to cache

Not every successful run should write to the semantic index:

  • STATIC_FAQ — policy answers stable for days/weeks; long TTL; safe for semantic reuse
  • SEMI_DYNAMIC — prices, inventory, account-specific facts; short TTL (minutes) or adapter-only slots
  • EPHEMERAL — time-bound promos, incident banners; never cache or TTL < 60s
  • PERSONALIZED — contains user PII in the answer body; do not store; lookup-only from anonymized queries
  • TOOL_WRITE — run invoked a mutating tool; block cache write ( dedup handles retries, not semantic reuse)

Classify at write time using route metadata, tool audit log, and output guardrails. A classifier misfire that caches a billing adjustment instruction is worse than skipping cache entirely.

Write path, quality gates, and negative caching

Cache writes should pass quality gates so bad answers do not amplify:

  1. Run completed without escalation or human override
  2. User did not thumbs-down within a feedback window (if available)
  3. Guardrails passed; no policy violation flags
  4. Response length and structure within expected bounds for the route
  5. Similarity to existing entry: if a neighbor already exists above 0.98, update metadata instead of duplicating vectors

Negative caching (storing “no good answer” markers) is usually avoided in agent systems — a later model or tool fix should retry. Exception: confirmed out-of-scope intents with stable refusal templates.

Harbor Support refactor

Harbor’s before state: only exact Redis keys on normalized user text; prefix cache on system prompt; no embedding index. Paraphrased tier-1 FAQs re-ran retrieval + generation. After state:

  1. Query normalizer with PII redaction and intent lane tag
  2. Per-tenant HNSW index on text-embedding-3-small@2024-01 (pinned)
  3. Namespace = {tenant}/support/tier1@{template_version}
  4. Similarity threshold 0.94 for FAQ lane; 0.97 for billing (stricter)
  5. Response adapter filling order status via one read-only tool
  6. Write gates tied to no-escalation + guardrail pass
  7. Metrics on hit rate, false-hit proxy (re-ask within 5 min), cost per resolution

Results after eight weeks: duplicate full generations 41% → 6.2%, semantic hit rate stabilized at 34% of tier-1 volume, median cost $0.38 → $0.14, p95 latency 4.1s → 2.3s, and wrong-answer escalation rate unchanged (within measurement noise). Template promotions now bump namespace automatically via the template rollout controller.

Decision table: semantic cache vs adjacent techniques

Approach Primary win When semantic cache is better When the alternative wins
Exact string cache Zero false hits; trivial implementation High paraphrase rate; FAQ/support tiers Low lexical variance; strict byte identity needed
Prompt prefix cache Cheaper prefill on static instructions Skip full completion on similar user questions Cost is mostly system prompt, not user turn
Tool result cache Avoid duplicate API/DB fetches Same question, same answer text after tools Bottleneck is tool latency, not generation
RAG retrieval Ground answers in source documents Reuse prior composed answers with adapter slots Source docs change often; need citations every time
Smaller / faster model routing Lower cost per token Near-duplicate hits avoid any model call Novel complex reasoning every turn

Production stacks layer all rows: RAG for grounding, tool cache for fetches, prefix cache for instructions, semantic cache for paraphrased tier-1 resolutions, and full generation for novel or high-risk lanes.

Common pitfalls

  • Global index without tenant namespaces — Tenant A’s refund policy serves Tenant B.
  • Threshold too low — “cancel order” matches “cancel subscription”; tune per lane.
  • Embedding model drift — Reindex on model change or hits silently degrade.
  • Caching personalized bodies — PII leakage across users; cache anonymized templates only.
  • Ignoring template version in namespace — Stale policy answers after prompt promotion.
  • No adapter for dynamic slots — Cached “your order ships tomorrow” ages badly.
  • Writing cache on failed guardrails — Amplifies bad outputs across paraphrases.
  • Missing observability — Cannot distinguish hit rate from false-hit re-asks without trace tags.

Production checklist

  • Normalize queries (PII redaction, intent lane) before embedding.
  • Pin embedding_model@version; plan reindex on upgrade.
  • Scope namespaces: tenant + route + template version + locale.
  • Set per-lane similarity thresholds; use cross-encoder tie-break near cutoff.
  • Classify freshness (STATIC_FAQ vs SEMI_DYNAMIC vs EPHEMERAL); block writes on TOOL_WRITE runs.
  • Adapt cached responses with live read-only slots; cap adapter tool calls.
  • Gate writes on guardrail pass, no escalation, and optional user feedback.
  • Bump namespace on template promotion; TTL-retire old namespaces.
  • Emit cache.semantic=HIT|MISS, similarity score, and cache_id on traces.
  • Track false-hit proxy: same user re-asks within N minutes after a hit.
  • Integrate hit savings into per-run cost ledgers.
  • Load-test index latency at peak QPS; set p99 lookup SLO.

Key takeaways

  • Paraphrases dominate support traffic; exact caches leave most savings on the table.
  • Namespaces must include tenant, route, and template version to prevent cross-policy hits.
  • Freshness classes and write gates matter more than similarity tuning alone.
  • Adapt cached skeletons with live slots instead of serving frozen personalized text.
  • Harbor Support cut duplicate generations from 41% to 6.2% with embedding-indexed semantic reuse.

Related reading