Guide

LLM sentence window retrieval explained

Harbor Support’s billing FAQ bot used 512-token chunks across a 2,400-page knowledge base. A user asked “Can I downgrade mid-cycle and keep my annual discount?” Dense retrieval returned a chunk about upgrade proration because “annual discount” appeared in the same paragraph. The model answered confidently that downgrades forfeit the discount — the opposite of policy. On 95 billing-intent probes, wrong-section answers hit 29% even though the correct sentence existed in the index.

Engineers switched to sentence-window retrieval: embed each sentence individually for search precision, but attach a precomputed neighbor window (five sentences before and after) in chunk metadata. At query time the retriever ranks sentences; the generator receives the expanded window text, not the isolated hit. Wrong-section answers fell to 7%; recall@5 on billing probes rose from 64% to 88%. This guide covers window architecture, sizing and overlap, deduplication when adjacent sentences co-rank, integration with cross-encoder reranking, the Harbor Support refactor, a technique decision table versus parent-child chunking and fixed-size splits, pitfalls, and a production checklist.

Why large chunks dilute retrieval precision

Embedding models compress an entire passage into one vector. When a chunk spans three unrelated procedures — upgrade rules, downgrade exceptions, and refund timelines — the vector becomes an average of all three topics. A query about downgrades may score 0.79 against that chunk because “annual discount” co-occurs with upgrade language, beating a narrower sentence that actually states the downgrade exception at 0.77.

Shrinking chunks helps precision but starves the generator. A lone sentence “Downgrades retain promotional pricing through the current term” lacks the qualifying clause two sentences later: “unless the plan tier crosses Enterprise.” Sentence-window retrieval splits the problem: search small, synthesize big.

Sentence-window architecture

At index time, the pipeline:

  1. Sentence-segment each document with a deterministic splitter (spaCy, NLTK, or layout-aware rules for PDFs). Preserve doc_id, sentence_index, and section headings.
  2. Embed the target sentence — the single sentence whose vector enters the ANN index.
  3. Precompute the window — concatenate window_before + target + window_after into a metadata field (commonly window_text). Store character offsets for citation highlighting.
  4. Index one vector per sentence with payload pointing to the window blob. Do not embed the full window unless you explicitly want dual signals.

At query time:

  1. Retrieve top-k sentences by cosine similarity (or hybrid BM25 on sentence text).
  2. Deduplicate overlapping windows — if sentences 42 and 43 both rank, merge their windows rather than sending duplicate paragraphs to the LLM.
  3. Pass deduplicated window_text blocks to the generator prompt (often 3–5 windows after merge).

Frameworks like LlamaIndex expose this as SentenceWindowNodeParser with MetadataReplacementPostProcessor swapping displayed node text from sentence to window after retrieval. The pattern is framework-agnostic: the invariant is vector on needle, context on thread.

Window sizing and overlap trade-offs

Window radius is measured in sentences, not tokens — but token budget still caps what you ship to the model.

  • Small window (±1 sentence) — minimal token cost; risks missing qualifying clauses that sit just outside the radius. Good for glossary definitions and atomic facts.
  • Medium window (±3 to ±5 sentences) — Harbor Support’s default. Captures most procedural exceptions without blowing the context budget when three hits merge.
  • Large window (±10 sentences or full paragraph) — approaches parent-child behavior; diminishing precision gains unless documents are very dense legalese.

Overlap between adjacent indexed sentences is implicit: sentence n and sentence n+1 share most of their windows. That redundancy helps recall (either sentence can surface the answer) but requires merge logic at retrieval. A simple rule: if two hits share more than 60% token overlap in their windows, keep the higher-scoring sentence only.

For tables and bullet lists, sentence segmentation fails. Fall back to row-level or list-item units as the “sentence,” with the surrounding list block as window — or route table-heavy pages to table extraction instead of naive sentence splitting.

Harbor Support refactor (worked example)

Before: 512-token fixed chunks, single index, no reranker. Billing FAQ recall@5 was 64%; wrong-section rate 29%; average prompt context 4.2k tokens (often redundant).

After sentence windows (±5 sentences, hybrid BM25 + dense RRF, BGE reranker on window text):

  • Recall@5 on billing probes: 64% → 88%
  • Wrong-section answers: 29% → 7%
  • Indexed vectors: 1.1M chunks → 4.8M sentences (+336% storage, acceptable on pgvector with HNSW)
  • p95 retrieval latency: +19 ms (sentence index is larger; reranker unchanged)
  • Generator input tokens: 4.2k → 2.1k average after window dedup

The win was not a new embedding model — it was aligning retrieval granularity with how users phrase questions while preserving synthesis granularity the LLM needs.

Technique decision table

Approach Best when Weak when
Fixed-size token chunks (512–1024) Uniform prose, low section density, storage-sensitive Multi-topic paragraphs, procedural docs with exceptions
Sentence-window retrieval FAQ, policy, runbooks; questions target specific claims; dynamic context radius Heavily structured tables; very short docs where windows exceed doc length
Parent-child (small-to-big) Stable parent boundaries (sections, pages); explicit parent text differs from child Parents drift when CMS restructures; need separate parent store
Contextual retrieval enrichment Chunks lack situating headers; corpus-wide disambiguation needed at index time Already sentence-precise; enrichment cost per re-index
Late chunking / long-context embedders Single embed pass over full doc; boundary-aware models available Legacy bi-encoders; strict latency budgets on ingest
Cross-encoder only (no window) Tiny corpora; rerank budget covers full passages Millions of sentences; p95 latency explodes without bi-encoder first stage

Sentence windows pair naturally with contextual retrieval: prepend a situating sentence to each indexed sentence before embedding, while still expanding the raw window for the generator.

Common pitfalls

  • Naive sentence splitting on PDFs — line breaks become false sentence boundaries; use layout-aware extraction first.
  • Embedding the window instead of the sentence — reintroduces averaging dilution; defeats the pattern.
  • No window deduplication — adjacent hits triple token cost and confuse the model with repeated paragraphs.
  • Window too small for legal qualifiers — “except as noted in Section 12” sits outside radius; tune per content class.
  • Ignoring section boundaries — windows bleed across unrelated H2 sections; clamp expansion at heading breaks.
  • Storing windows without offset metadata — citation highlighting breaks; keep start_char / end_char.
  • Skipping eval on merged windows — recall gains on sentences can hide context truncation regressions in the generator.

Production checklist

  • Choose sentence splitter per format (HTML, Markdown, PDF); test on worst layouts.
  • Set window radius per content class; document token budget after merge.
  • Index sentence vectors only; store window_text and offsets in payload.
  • Implement overlap dedup before prompt assembly (token-overlap threshold or index range).
  • Clamp window expansion at section headings and table boundaries.
  • Hybrid retrieval (BM25 on sentence + dense) for proper-noun-heavy queries.
  • Optional reranker scores window_text, not isolated sentence.
  • Log sentence score, window token count, and merge decisions for debugging.
  • Build eval probes that require cross-sentence qualifiers, not single-sentence facts.
  • Monitor index size growth; plan HNSW rebuild cadence when sentence count 3×+.

Key takeaways

  • Large chunks average multiple topics into one vector; sentence-window retrieval searches needles and synthesizes threads.
  • Embed the target sentence, store neighbor context in metadata, deduplicate overlapping windows before the LLM prompt.
  • ±3 to ±5 sentences is the practical default for procedural docs; clamp at section boundaries and handle tables separately.
  • Harbor Support cut wrong-section answers from 29% to 7% with +19 ms p95 latency — a chunking fix, not a new model.
  • Compare against parent-child when parent boundaries are stable; use contextual enrichment when chunks need situating headers at index time.

Related reading