Guide
LLM query expansion for retrieval explained
Harbor Legal’s contract-review assistant kept missing indemnity clauses. Lawyers asked “Who pays if a subcontractor leaks PII?” but the indexed MSAs used dry headings like “Limitation of Liability and Indemnification.” Single-query dense retrieval hit the right section only 54% of the time on their 120-question eval set. After adding multi-query expansion (three LLM-generated rewrites per question) and reciprocal rank fusion across vector and BM25 hits, context recall rose to 81% with under 200 ms extra latency on the retrieval path. The generator did not get smarter — the retriever finally saw the paragraphs lawyers expected.
Query expansion is any technique that transforms a user question into additional search queries or synthetic text before hitting your index. It bridges the vocabulary gap between casual questions and formal documentation, improves recall on paraphrases and abbreviations, and pairs naturally with hybrid search and reranking. This guide covers multi-query generation, HyDE (Hypothetical Document Embeddings), step-back and decomposition prompts, fusion strategies, the Harbor Legal refactor, a technique decision table vs chunking and agentic RAG, pitfalls, and a production checklist — building on our RAG fundamentals explainer.
Why raw user queries under-retrieve
Embedding models map text into a vector space where semantic similarity should correlate with relevance. In practice, retrieval fails when:
- Lexical mismatch — the user says “cancel fee” but the policy says “early termination charge.”
- Implicit context — “Does this cover EU customers?” assumes GDPR without naming it.
- Multi-faceted questions — one embedding averages competing intents (“pricing and SLA for enterprise tier”).
- Short or noisy queries — chat typos, acronyms, or two-word prompts produce unstable vectors.
- Chunk boundary effects — the right answer spans chunks indexed under different headings (see chunking strategies).
Query expansion attacks the query side of the pipeline. Better chunking attacks the index side. Most production systems need both.
Core expansion techniques
Multi-query generation
An LLM rewrites the user question into N paraphrases from different angles (synonyms, formal legal phrasing, keyword-heavy variants). Each variant runs an independent retrieval; results merge via union, weighted score, or reciprocal rank fusion (RRF), which ranks documents by how often they appear high across lists without calibrating incompatible scores.
Typical prompt: “Generate 3 search queries that would find documents answering this question. Use distinct vocabulary.” Keep N between 3 and 5; beyond that, latency and duplicate chunks dominate gains.
HyDE — Hypothetical Document Embeddings
Instead of embedding the question, ask the LLM to write a short hypothetical answer paragraph as if it were excerpted from your knowledge base. Embed that synthetic passage and search with it. HyDE shines when questions are abstract but answers are concrete (“What happens if we miss the uptime SLA?” → a fake SLA remedy paragraph matches real remedy text).
HyDE can hallucinate domain facts into the hypothetical doc, pulling irrelevant chunks. Use a small, cheap model, cap output length (80–150 tokens), and instruct “write a passage that might appear in internal docs” without inventing numbers or dates.
Step-back prompting
Generate a broader background question first (“What are the general rules for data processing agreements?”), retrieve on both the specific and step-back queries, then deduplicate. Helps when the user’s question is too narrow to land on the governing section.
Sub-query decomposition
For compound questions, an LLM splits into sub-questions, retrieves per sub-query, and merges contexts. Overlaps with agentic RAG but stays a fixed pipeline (no tool loop). Good for compare/contrast prompts.
Query normalization (lightweight)
Spell-check, expand acronyms from a glossary table, inject locale or product SKU from session metadata. Cheap, deterministic, and should run before any LLM expansion.
Pipeline placement and fusion
A practical retrieval stack after expansion:
- Normalize query (glossary, metadata filters).
- Expand via multi-query and/or HyDE (parallel LLM calls).
- Retrieve each variant with dense + sparse ( hybrid search).
- Fuse with RRF across all query variants and search modes.
- Rerank top 30–50 fused hits with a cross-encoder ( reranking).
- Pass top-k chunks to the generator with citations.
RRF formula (per document d): score(d) = ∑ 1 / (k + ranki(d)) with k ≈ 60. It is robust when BM25 scores and cosine similarity are not comparable. Deduplicate chunks by content hash or parent document ID before reranking to avoid the same paragraph filling all k slots.
Worked example: Harbor Legal contract search
Harbor Legal indexes 18,000 MSA and SOW PDFs chunked at 600 tokens with heading breadcrumbs. Baseline: single-query dense retrieval (text-embedding-3-large) + BM25, top-8 chunks, no reranker. Context recall@8 = 54% on lawyer-authored eval.
Changes shipped:
- Multi-query: GPT-4o-mini generates 3 rewrites (temperature 0.3, max 40 tokens each).
- HyDE disabled for numeric/compliance questions (eval showed 7% recall regression).
- RRF across 4 query strings (original + 3 rewrites) × 2 search modes = 8 ranked lists.
- Cross-encoder reranker (bge-reranker-large) on fused top 40 → final top 8.
- Metadata filter:
doc_typeandeffective_datefrom UI.
Results: context recall@8 = 81%; p50 retrieval latency +180 ms (parallel expansion calls); answer faithfulness on grounded eval +12 points. Monthly LLM expansion cost ~$14 at their query volume vs ~$2,400/hour saved in associate search time (internal estimate).
What did not help: HyDE on “What is the liability cap in Exhibit B?” — hypothetical paragraphs invented dollar amounts that retrieved wrong exhibits. Step-back helped only on 9% of queries (governance overview questions).
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Multi-query paraphrase | Vocabulary mismatch, informal user language, multilingual users | Queries already match index terminology (SKU lookup) |
| HyDE | Abstract questions, conceptual docs, sparse indexes | Numeric precision, citations to exact clauses, regulated facts |
| Step-back | Narrow questions needing parent section context | FAQ-style atomic facts |
| Sub-query decomposition | Multi-part compare/contrast questions | Simple single-intent lookups |
| Hybrid search only | Keyword-heavy corpora, SKUs, error codes | Heavy paraphrase gap remains after BM25+dense |
| Agentic multi-hop RAG | Reasoning across many documents, dynamic tool use | Latency-sensitive chat; eval shows single-hop suffices |
| Bigger chunks / parent-child | Recall fails despite good query match (boundary issue) | Precision already low (noise in context window) |
Common pitfalls
- Expansion drift — rewrites introduce new entities not in the user question, retrieving off-topic docs. Constrain prompts: “do not add facts not implied by the question.”
- Latency stacking — serial LLM calls before retrieval kill UX. Run expansion calls in parallel; cache rewrites for repeated FAQ strings.
- Duplicate chunks in context — fusion without dedup wastes
the generator’s token budget. Collapse by
parent_idor max-similarity threshold. - Skipping eval on expansion — aggregate recall can rise while precision on edge cases falls. Measure recall and faithfulness per query type.
- HyDE on structured data — tables, APIs, and JSON logs need keyword or metadata retrieval, not synthetic prose.
- No reranker after fusion — RRF broadens recall but top ranks may still be noisy; a cross-encoder on 30–50 candidates is high ROI.
Production checklist
- Build a golden eval set with labeled relevant chunk IDs per question.
- Measure baseline single-query recall@k before adding expansion.
- Implement glossary-based normalization before LLM expansion.
- Start with multi-query (N=3); A/B HyDE on a held-out slice only.
- Parallelize expansion LLM calls; set tight max tokens and low temperature.
- Fuse with RRF; deduplicate chunks before reranking.
- Add cross-encoder reranker if not already present.
- Log each rewrite and retrieved IDs for debugging misfires.
- Cache expansion outputs for identical queries within TTL.
- Re-eval when embedding model or index schema changes.
Key takeaways
- Query expansion fixes vocabulary and intent gaps on the search side of RAG — not the generator.
- Multi-query + RRF is the safest first upgrade; HyDE helps conceptual docs but hurts precision-heavy tasks.
- Harbor Legal raised context recall from 54% to 81% with three paraphrases and fusion, not a larger model.
- Always pair expansion with deduplication and reranking — raw union lists flood the context window.
- Measure per query type; expansion that helps legal prose can hurt SKU or numeric lookup.
Related reading
- RAG retrieval-augmented generation explained — end-to-end retrieve-then-generate architecture
- Hybrid search explained — combining dense vectors with BM25 keyword retrieval
- LLM reranking explained — cross-encoders that reorder fused candidate lists
- RAG chunking strategies explained — index-side recall when queries are already well formed