Guide

LLM semantic caching explained

“How do I reset my password?” and “I forgot my login, what now?” are the same support ticket phrased differently — yet most chatbots call a frontier model twice, burning tokens both times. Semantic caching stores past question–answer pairs, embeds each new user query into a dense vector, and searches for near-duplicates by cosine similarity. On a hit above your threshold, the cached answer returns instantly with zero LLM inference cost. On a miss, the model generates a fresh reply and the pair enters the cache for the next visitor. Unlike prompt caching, which reuses identical prefixes at the provider level, semantic caching matches paraphrases and works across sessions and users. It pairs naturally with embedding models and vector stores, and belongs in any LLM cost optimization stack where repetitive FAQ-style traffic dominates. This guide covers the request flow, similarity thresholds and false-positive risk, TTL and invalidation, quality controls, implementation patterns, a Harbor Support FAQ router worked example, an approach decision table, common pitfalls, and a production checklist.

Semantic cache vs prompt cache vs exact match

Three caching layers solve different problems. Confusing them leads to either zero savings or wrong answers served with confidence.

Exact-match cache — hash the full prompt string; return stored output if identical. Fast, zero false positives, useless for paraphrases.
Prompt caching (provider prefix) — the API reuses computed KV state when the static prefix bytes match. Saves prefill cost on repeated system prompts and RAG document blocks; does not skip generation when the user question changes.
Semantic cache — embed the user query (or the full conversational turn), find the nearest neighbor in a vector index, return the associated answer if similarity exceeds a threshold. Catches paraphrases; requires tuning to avoid false hits.

Production systems often stack all three: semantic cache first (cheapest on hit), then prompt caching on the LLM call for misses, with exact-match as a belt-and-suspenders layer for deterministic tool calls.

Request flow: hit, miss, and store

A typical semantic cache sits in front of your LLM gateway:

Normalize the incoming query — lowercase, strip whitespace, optionally remove PII placeholders.
Embed with a small, fast embedding model (e.g. text-embedding-3-small, BGE, or E5).
Search the vector index for top-k neighbors (k=1–5 is common).
Compare cosine similarity to your threshold (often 0.85–0.95 depending on domain).
On hit — return cached response; log cache_hit metric; optionally re-validate with a lightweight classifier if stakes are high.
On miss — call the LLM; store query embedding + response + metadata (model version, timestamp, TTL).

Latency on a hit is dominated by one embedding call plus a vector lookup — typically tens of milliseconds versus seconds for a full generation. Cost drops from input + output tokens to a fraction of a cent for the embedding alone.

Similarity thresholds and false positives

The threshold is the product decision. Too low and “cancel my subscription” hits “upgrade my plan” — a support disaster. Too high and hit rate stays near zero, saving nothing.

Calibrating on real traffic

Export a week of production queries, embed them, and inspect nearest-neighbor pairs manually or with an LLM-as-judge. Plot similarity scores for pairs humans label equivalent vs distinct. The elbow where equivalent pairs cluster above ~0.90 and distinct pairs fall below ~0.85 is your starting band. Domain matters: legal and medical apps need tighter thresholds (0.93+) than marketing FAQ bots (0.88 may suffice).

Context-aware keys

Cache keys should include more than the raw user string when answers depend on context:

Locale and product tier — “pricing” means different plans for free vs enterprise.
RAG document version — bump cache namespace when your knowledge base updates.
Model and prompt version — policy_v4 answers differ from policy_v3.
Tool state — account-specific queries must never hit another user’s cached answer; partition by user ID or skip caching entirely for personalized paths.

TTL, invalidation, and freshness

Cached LLM answers go stale when policies, prices, or APIs change. Treat semantic cache entries like any other derived data with explicit lifecycle rules.

Time-to-live (TTL) — FAQ answers might live 7–30 days; release notes 24 hours; stock prices never cache semantically.
Version tags — prefix cache keys with kb_2026-06-09; bulk-delete on deploy.
Feedback loops — thumbs-down on a cached answer should evict that entry immediately.
Canary re-generation — on hit, occasionally (e.g. 1%) still call the LLM and compare; drift triggers threshold review.

Unlike RAG, which retrieves source documents at query time, semantic cache returns a pre-generated answer. When upstream facts change, invalidate before users see outdated instructions.

Implementation patterns

Managed and open-source libraries

GPTCache (Zilliz) provides embedding, similarity evaluation, and storage backends in one Python package. LangChain and similar frameworks expose SemanticCache wrappers. For minimal dependencies, pair Redis Stack vector search or pgvector with your own threshold logic.

Storage backends

In-memory — dev and low-traffic; lost on restart.
Redis / Valkey with vector module — sub-millisecond lookups at moderate scale.
pgvector, Milvus, Qdrant, Pinecone — millions of entries, metadata filtering by tenant or product.

What to store per entry

{
  "query_text": "how do i change my billing address",
  "query_embedding": [0.012, -0.034, ...],
  "response_text": "Go to Settings > Billing...",
  "model": "gpt-4o-mini",
  "prompt_version": "support_v12",
  "created_at": "2026-06-09T14:22:00Z",
  "ttl_seconds": 604800,
  "hit_count": 47
}

Store enough metadata to debug wrong hits and to filter searches by namespace before similarity ranking — cheaper than post-filtering thousands of global neighbors.

Worked example: Harbor Support FAQ router

Harbor Support handles ~4,000 tickets/week across billing, shipping, and account access. Before semantic caching, 62% of tickets triggered a full GPT-4o call with a 2,400-token RAG context — average cost $0.018 per ticket. After adding a semantic cache layer:

Namespace: harbor_support_v12 keyed by prompt version and locale (en-US).
Embed: text-embedding-3-small on the ticket subject + first customer message (max 512 tokens).
Search: pgvector top-3 within namespace; threshold 0.91 for billing/account, 0.94 for refund policy (higher stakes).
On hit: return cached agent draft; human reviewer still approves before send (cache saves LLM, not human QA on high-risk categories).
On miss: RAG + GPT-4o generates reply; store embedding + approved final text after human edit (cache the shipped answer, not the raw model draft).
Invalidate: webhook from docs CMS on publish flushes entries tagged with affected article IDs.

Results after four weeks: 41% semantic hit rate, blended cost per ticket down to $0.011 (−39%), p95 latency on hits 180 ms vs 4.2 s on misses. False-positive rate tracked at 0.3% via random human audits — one bad hit per ~330 cached responses, acceptable with human-in-the-loop on refunds.

Approach decision table

Scenario	Best approach	Why
Identical API calls with huge static system prompt?	Prompt caching	Provider reuses prefix KV; no vector infra needed.
FAQ / support with paraphrased questions?	Semantic cache	Paraphrase tolerance; high hit rate on repetitive traffic.
Answers must cite live database rows?	No semantic cache	Stale cache serves wrong account balance or inventory.
Creative writing, unique each time?	Skip caching	Near-zero hit rate; added latency on every miss.
Multi-turn chat with evolving context?	Cache last user turn + summary key	Full transcript embedding dilutes similarity; summarize intent first.
Regulated domain (health, legal)?	Semantic cache + human gate	Threshold 0.95+; mandatory review on hit for high-risk intents.
Cost still too high on misses?	Semantic cache + model cascade	Route misses to smaller model; cache both tiers separately.

Common pitfalls

Cross-user leakage — caching “my order status” without user-scoped namespaces returns another customer’s data.
Threshold copied from a blog post — 0.85 works for one dataset and fails on yours; always calibrate locally.
Caching before human approval — storing raw model hallucinations poisons the cache permanently until eviction.
Ignoring embedding model drift — switching from ada-002 to a new model re-embeds the entire cache or hit rate collapses.
Global index without metadata filters — searching 10M vectors when 500 belong to this tenant adds latency and wrong-tenant hits.
No hit-rate metrics — without cache_hit_ratio and false_positive_audit you cannot tune thresholds.
Semantic cache as RAG replacement — cache returns old answers; RAG retrieves fresh sources. Use both, not either-or.
Embedding the full RAG context — embed the user question only; document chunks change the vector unpredictably.

Practitioner checklist

Instrument baseline cost and latency per request before enabling cache.
Choose an embedding model matched to your language and domain.
Define cache namespaces: prompt version, locale, product, and user scope.
Calibrate similarity threshold on labeled query pairs from production logs.
Set TTLs per content type; wire CMS or deploy webhooks to bulk-invalidate.
Store approved final responses, not raw model output, for support workflows.
Log hits, misses, similarity scores, and evictions for weekly threshold review.
Run periodic canary comparisons between cached and fresh LLM answers.
Evict entries on thumbs-down or explicit “outdated” feedback.
Document which query classes are never cached (PII lookups, live balances, medical triage).

Key takeaways

Semantic caching matches paraphrased queries via embedding similarity and returns stored LLM answers without a full inference call.
It complements — not replaces — prompt caching (prefix reuse) and RAG (fresh document retrieval).
Threshold tuning and namespace design are product decisions; false positives erode trust faster than cache misses erode budget.
Best ROI on repetitive FAQ, support, and onboarding flows; skip for personalized, time-sensitive, or creative tasks.
Invalidate aggressively when policies change; cache the answers you actually ship, not draft model output.