Guide

LLM HyDE explained

Harbor Support's internal knowledge base held 4,200 runbooks written in complete sentences — “To reset a stuck SAML assertion, open the admin console under Identity > Providers…” — but agents searched with terse tickets: “SSO broken after cert rotate.” Embedding the raw question against passage vectors missed the right doc 42% of the time on held-out tickets. Keyword BM25 helped acronyms but ignored paraphrase. Engineers added HyDE (Hypothetical Document Embeddings): a small LLM drafts a short passage that looks like the answer the KB should contain; that hypothetical text is embedded and used as the dense retrieval query. Recall@10 on acronym-heavy tickets rose from 58% to 81%; wrong escalations fell 31% to 12% before any reranker ran.

HyDE is a zero-shot dense retrieval trick from Gao et al. (2022): instead of encoding the user question, encode an LLM-generated document that answers it. The hypothetical passage lives in the same embedding space as indexed chunks, shrinking the query–document vocabulary gap. This guide covers prompt templates, single vs multi-HyDE, pairing with reciprocal rank fusion and cross-encoders, latency and hallucination guardrails, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why questions and documents embed differently

Bi-encoder retrieval models (e.g. E5, BGE, text-embedding-3) are trained so that a question embedding should be near a passage embedding that answers it. In practice, indexed corpora are overwhelmingly declarative prose while live queries are telegraphic: missing articles, domain slang, half-typed error codes. Cosine similarity between “VPN drops on sleep” and a three-paragraph troubleshooting guide can lose to a tangentially related FAQ that happens to share token overlap.

Classical fixes include query expansion (paraphrase the question), hybrid BM25+dense fusion, and cross-encoder reranking. HyDE attacks the mismatch from the query side by transforming the question into answer-shaped text before embedding — without fine-tuning the retriever on your domain.

The HyDE pipeline

A minimal HyDE retrieval pass has four steps:

Hypothesis generation — prompt an LLM: “Write a short passage that would answer: {question}. Use the tone and structure of our documentation. Do not cite sources; invent plausible procedural steps.” Temperature moderate (0.5–0.8) for lexical diversity across retries.
Embedding — encode the hypothetical passage with the same bi-encoder used for corpus chunks (same model version and prefix instructions, e.g. passage: vs query: if the model expects them).
Vector search — ANN lookup against the chunk index; return top-k (often 20–50 before fusion or rerank).
Downstream use — feed retrieved chunks to a generator, or merge HyDE dense hits with BM25 via RRF before a cross-encoder rerank.

The hypothetical document is never shown to end users and must not be treated as ground truth. It is a search probe only. Factual errors in the hypothesis are acceptable if they steer retrieval toward the right real passage — though egregious hallucinations can pull wrong neighbors (see pitfalls).

Prompt design and multi-HyDE

Single-hypothesis template

System: You write internal IT runbook excerpts. 80–120 words.
User: Question: "{user_query}"
Write a runbook paragraph that would answer this question. Include likely
product names and menu paths. Do not say you are guessing.

Anchor the style to your corpus: legal memos, API reference paragraphs, medical patient handouts. Mismatched tone still embeds, but domain-shaped hypotheses retrieve better.

Multi-HyDE

Sample n hypotheses (different temperatures or explicit “variant 2” prompts), embed each, run n ANN queries, merge ranks with RRF. Diversity captures multiple plausible answer framings — useful when tickets are ambiguous (“login fails” could be SSO, MFA, or captcha). Cost scales linearly with n; Harbor Support uses n=3 on tier-2 tickets only.

When not to use HyDE alone

Exact ID lookup — error codes, CVE numbers, SKU strings: BM25 or metadata filters win; HyDE may paraphrase away the token.
Very short queries — one-word searches may produce wildly speculative hypotheses; combine with keyword retrieval.
Fresh news — the LLM cannot hypothesize facts absent from training; pair with freshness-aware indexes or web search tools.

Pairing HyDE with hybrid retrieval

Production systems rarely rely on HyDE dense search alone. A robust stack:

List A — BM25 on raw query (preserves exact tokens).
List B — dense ANN on raw query embedding (baseline).
List C — dense ANN on HyDE hypothesis embedding.
Fusion — RRF across A+B+C with equal or HyDE-downweighted constant (some teams use k=60 for BM25, same for HyDE lists).
Rerank — cross-encoder on top 30 fused hits using the original user question, not the hypothesis (reranker sees true intent).

This pattern preserves acronym recall from BM25, paraphrase recall from HyDE dense, and precision from the reranker. Latency budget: hypothesis generation (~200–400 ms on a small model) + one extra embedding forward pass + fused ANN (often cached in the same vector DB).

Harbor Support refactor

Baseline RAG: 768-dim bi-encoder, 512-token chunks, BM25+dense RRF, no HyDE. Pain on 200 held-out tickets where agents used abbreviations (SSO, MFA, Okta, SCIM) or symptom-only phrasing.

HyDE rollout:

Generator — 8B instruct model on CPU batch; 100-token cap on hypotheses; runbook tone few-shot in system prompt.
Routing — skip HyDE when query matches ^[A-Z]{2,5}-\d+ error-code pattern (BM25-only fast path).
Triple-list RRF — BM25 + raw dense + HyDE dense; cross-encoder rerank top 25.
Guardrail — if top HyDE hit score < 0.72 cosine and BM25 list empty, fall back to agentic query rewrite instead of trusting a weak hypothesis.

Metrics (same reranker and generator as baseline):

Recall@10 (human-labeled relevant doc in top 10): 58% → 81% on acronym/symptom subset; 74% → 76% on already-easy literal queries.
Wrong escalation rate (tier-2 → engineering): 31% → 12%.
p95 retrieval latency: 180 ms → 340 ms (acceptable for internal support; customer chatbot uses HyDE only after first miss).
Hypothesis generation cost: ~$0.0003 per ticket at batch pricing.

Technique decision table

Approach	Strengths	Weaknesses	Best when
HyDE	Closes query–doc style gap; zero-shot; no retriever fine-tune	Extra LLM + embed latency; hypothesis can mislead	Declarative KB; short informal queries; bi-encoder retrieval
Multi-query expansion	Paraphrase diversity; cheaper than full passages	Still question-shaped embeddings	Moderate mismatch; budget-sensitive
Raw dense only	Lowest latency	Weak on jargon vs prose gap	Queries already match doc style (e.g. developer docs)
BM25 / hybrid	Exact token match; acronyms	Synonym and paraphrase blind spots	Always combine with dense or HyDE in enterprise KB
Cross-encoder rerank only	High precision on candidates	Cannot fix recall if true doc never in pool	After HyDE or hybrid widens the candidate pool
Retriever fine-tuning	Domain-optimal embeddings	Needs labeled pairs; maintenance	Stable corpus + abundant click/logs data

Common pitfalls

Treating the hypothesis as context — injecting it into the generator invites fabricated steps; use only for retrieval.
Wrong embedding prefix — encoding a hypothesis with query: when the model expects passage: blunts gains.
Overlong hypotheses — 500-token fantasies dilute the embedding vector; cap length to your chunk size.
Confident wrong steering — a plausible but incorrect hypothesis can rank a related-but-wrong doc highly; keep BM25 and rerankers.
Skipping evaluation on easy queries — HyDE can slightly hurt literal matches; route by query shape.
Same model for gen and embed — unnecessary; use a cheap fast generator and a dedicated embedding model.
No cache — repeat tickets in chat sessions should cache hypothesis + ANN results per session ID.

Production checklist

Define hypothesis prompt with corpus tone few-shots and strict length cap.
Route error-code and SKU-shaped queries to BM25-first paths.
Embed hypotheses with the same bi-encoder and prefix as indexed chunks.
Fuse HyDE dense list with BM25 and raw dense via RRF before reranking.
Rerank with cross-encoder on the original user question, not the hypothesis.
Never pass hypothetical text to the answer generator as trusted context.
Log hypothesis, retrieved IDs, and scores for offline replay and safety review.
A/B recall@k and escalation rate against baseline hybrid search.
Cache hypotheses per conversation turn; cap multi-HyDE to high-ambiguity tickets.
Monitor latency p95 and generator failure rate; fall back to raw dense on timeout.

Key takeaways

HyDE embeds an LLM-written answer passage instead of the raw question — closing the style gap between terse queries and declarative knowledge bases.
Hypothetical text is a retrieval probe only; factual errors are tolerable for search but must not reach users as sourced content.
Harbor Support triple-list RRF (BM25 + raw dense + HyDE) lifted recall@10 from 58% to 81% on acronym-heavy tickets.
Pair HyDE with BM25 for exact tokens and cross-encoder reranking for precision — never rely on HyDE dense alone.
Route by query shape, cap hypothesis length, and cache per session to control cost and latency.