Guide

LLM contextual retrieval explained

Harbor Support indexed 2.1 million help-desk chunks with standard fixed-size splitting and BGE-large embeddings. On a 600-question eval set drawn from real tickets, recall@10 sat at 61% — agents kept answering from the wrong policy section because chunks like “Section 4.2 applies only when the account tier is Enterprise” lost the document title and surrounding headings at embed time. Swapping to a larger embedding model barely moved the needle (+2.4 points). The team instead adopted contextual retrieval: one cheap LLM call per chunk at index time to prepend situating context, plus a BM25 hybrid search layer on the enriched text. Recall@10 rose to 84%; end-to-end answer accuracy on human-graded tickets climbed from 71% to 79% without changing the generation model.

Contextual retrieval is an index-time enrichment pattern for RAG pipelines. Instead of embedding naked chunk text, you ask a model to write a short preamble that explains where the chunk lives inside the parent document — product name, section, date range, audience — then embed the combined string. The same enriched text feeds a lexical BM25 index, so keyword-heavy queries (SKU codes, error strings, regulation numbers) recover chunks that pure vector search misses. This guide covers the failure mode contextual retrieval fixes, the enrichment prompt pattern, hybrid fusion, cost and latency tradeoffs, a Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why isolated chunks fail retrieval

Most RAG systems split documents into passages of 256–512 tokens, embed each passage, and retrieve by cosine similarity to the query vector. That works when chunks are self-contained FAQs. It breaks on:

Cross-references — “See Appendix B” or “as defined in Section 1” carry no meaning without the parent doc.
Enumerated lists — item 7 of 12 looks identical to item 7 of a different list unless the list title is embedded with it.
Tables and appendices — a row of numbers embeds near unrelated numeric rows from other products.
Versioned policies — two chunks with nearly identical wording but different effective dates collapse in vector space.
Pronouns and shorthand — “This plan” or “the above exception” need upstream context the chunk never stores.

Parent-child indexes, sentence-window retrieval, and query expansion each patch part of this problem. Contextual retrieval attacks the root cause at index time: make every chunk locatable before it enters the vector and lexical indexes. Anthropic’s published benchmarks reported up to 49% fewer failed retrievals when contextual embeddings combined with BM25 hybrid search — your mileage depends on corpus structure, but the pattern is especially strong on long policy PDFs, API references, and legal contracts.

Contextual embeddings: the index-time enrichment loop

At ingestion, for each chunk c you pass the full parent document D (or a bounded prefix if D exceeds the model window) plus c to a prompt along these lines:

Here is a document and a chunk from it. Write a short situating preamble (50–100 tokens) that explains what document this is, which section the chunk belongs to, and any critical qualifiers (dates, tiers, regions). Output only the preamble — do not repeat the chunk.

The model returns a context block ctx. Your stored retrieval unit becomes ctx + "\n\n" + c. That string is what you:

Embed with your bi-encoder (same model as before — no retraining required).
Tokenize into the BM25 / inverted index (with the same analyzer you use for queries).
Persist in object storage alongside the raw chunk for citation display.

At query time nothing changes for the user: you embed the query, run hybrid retrieval, optionally rerank, and pass top chunks to the generator. The generator should receive the original chunk text (not the enriched prefix) to avoid duplicating context tokens in the prompt — store both fields in metadata.

Model choice for enrichment: a fast, cheap instruction model (Haiku-class or a local 7–8B) is sufficient. Enrichment is not reasoning; it is summarization of document structure. Temperature 0, max output 120 tokens, and a JSON schema wrapper if you batch via an async pipeline. Cache enrichment by hash(document_id + chunk_offset + prompt_version) so re-index jobs skip unchanged chunks.

BM25 hybrid: why enrichment alone is not enough

Contextual embeddings improve semantic recall on ambiguous paraphrases. They do not help when the user pastes an exact error code, ticket ID prefix, or statutory citation that never appeared in the situating preamble. BM25 hybrid search closes that gap:

Dense channel — cosine similarity on embeddings of ctx + chunk.
Sparse channel — BM25 on the same enriched string (or on chunk + metadata fields like product_sku).
Fusion — reciprocal rank fusion (RRF) or weighted linear combination after per-channel min-max normalization.

Typical starting weights: 0.65 dense / 0.35 sparse for prose-heavy corpora; flip toward 0.45 / 0.55 for API docs and log-message KBs. Re-tune on your labeled query set — contextual retrieval shifts the dense channel more than the sparse one, so old hybrid weights from pre-enrichment indexes are usually wrong.

Enrichment also gives BM25 more lexical hooks: the preamble injects product names, section numbers, and effective dates that the raw chunk omitted. That is why the Anthropic recipe pairs both changes; either alone underperforms the combination on their public evals.

Cost, latency, and when to re-enrich

Contextual retrieval trades index-time compute for query-time recall. Rough math for Harbor Support:

2.1M chunks × ~800 input tokens (doc prefix truncated to 6k) + ~80 output tokens ≈ 1.85B tokens one-time at ingest.
At $0.25 / M input + $1.25 / M output (Haiku-tier pricing), full re-index ≈ $670 — cheaper than one week of mis-routed escalations.
Incremental updates: only new or edited documents pay enrichment cost; chunk splits on unchanged text reuse cached ctx.

Query latency is unchanged if enrichment stays offline. Avoid query-time contextualization (generating ctx live per request) unless documents are single-chunk and tiny — it doubles TTFT and adds failure modes.

Re-enrich when: prompt template version bumps, parent document structure changes (new headings), or you migrate embedding models (re-embed the enriched string, not the raw chunk). Version enrichment_prompt_v3 in index metadata alongside incremental index updates.

Harbor Support knowledge base refactor

Harbor Support’s corpus mixed Zendesk articles, Confluence exports, and PDF runbooks. Fixed 400-token chunks with 50-token overlap produced 2.1M rows. The refactor ran in five stages:

Baseline eval — 600 ticket-derived questions with human-labeled gold chunk IDs; measured recall@5, recall@10, MRR.
Enrichment batch — async JSONL through Claude Haiku; parent doc truncated to first 8k tokens + chunk; outputs cached in S3.
Dual index rebuild — HNSW on enriched embeddings; Elasticsearch BM25 on ctx + chunk with same ICU analyzer as before.
Hybrid tune — grid search on 80/20 query split; RRF with k=60 beat weighted sum on their skewed tail queries.
Generator unchanged — prompts still received raw chunk text + citation metadata; only retrieval moved.

Results on the held-out 120 queries: recall@10 61.2% → 84.1%; MRR 0.48 → 0.63. End-to-end answer accuracy (human graders, 200 tickets) 71% → 79%. Index storage grew 11% from longer embedded strings; one-time enrichment cost $640. They kept semantic chunking for new Confluence imports but dropped parent-child dual retrieval — contextual enrichment made the child index redundant for their doc shapes.

Technique decision table

Goal	Prefer	Avoid
Long PDFs with cross-references	Contextual embeddings + BM25 hybrid	Naive fixed chunks + dense-only
Exact-ID lookup (SKUs, error codes)	BM25 hybrid on enriched text + metadata filters	Contextual embeddings alone
Strict ingest budget, small corpus	Parent-child or sentence-window indexes	Full-corpus LLM enrichment
Frequently edited wiki pages	Cached enrichment keyed by doc revision hash	Re-embedding entire corpus on every edit
Multilingual KB	Enrichment in source language; multilingual embedder	English-only context prompts on non-English docs
Regulated audit trail	Store `ctx`, prompt version, model ID per chunk	Ephemeral enrichment with no reproducibility log
Already using query expansion (HyDE)	Contextual index + lighter expansion	Stacking HyDE and enrichment without ablation

Common pitfalls

Feeding enriched text to the generator — duplicates tokens and can leak preamble phrasing into answers; retrieve enriched, generate on raw chunk.
Enrichment hallucination — models invent section numbers; validate ctx against parse tree headings or reject outliers.
Parent doc truncation blind spots — chunk from page 40 of a 200-page PDF gets wrong context if you only pass the first 8k tokens; pass a local window around the chunk instead.
Skipping hybrid after enrichment — dense recall improves but lexical tails still need BM25.
Reusing old hybrid weights — re-tune fusion after any index enrichment change.
Enriching boilerplate chunks — copyright footers and nav chrome waste LLM calls; filter before enrichment.
No prompt versioning — silent drift when you change the situating template without re-indexing.
Evaluating only on paraphrase queries — include exact-match and typo queries in the eval set or hybrid tuning lies.

Production checklist

Build a labeled eval set (300+ queries) with gold chunk IDs before changing indexes.
Measure recall@5, recall@10, and MRR on dense-only, BM25-only, and hybrid baselines.
Design enrichment prompt with document title, section path, dates, and audience.
Pass a local document window around each chunk, not always doc head truncation.
Cache enrichment output keyed by document revision + chunk offset + prompt version.
Store raw chunk, context block, and enriched string as separate metadata fields.
Rebuild both vector and BM25 indexes on the enriched string.
Re-tune hybrid fusion weights on a held-out query split.
Run end-to-end answer accuracy eval, not retrieval metrics alone.
Log enrichment model, prompt version, and timestamp for audit reproducibility.

Key takeaways

Contextual retrieval fixes the “orphan chunk” failure mode by prepending LLM-generated situating context before embedding and indexing.
Pair contextual embeddings with BM25 hybrid search — dense gains on paraphrase, sparse gains on exact tokens and IDs.
Enrichment is an index-time cost; query latency stays flat if you cache and batch properly.
Generators should cite raw chunk text; enriched strings are for retrieval channels only.
Harbor Support lifted recall@10 from 61% to 84% and answer accuracy from 71% to 79% with Haiku enrichment plus RRF hybrid — without changing the answer model.