Guide
LLM RAG chunking strategies explained
Harbor Support indexed 18,400 help-center articles for a customer-facing refund and billing bot. The first pipeline used a naive 512-token fixed splitter with zero overlap: fast to implement, easy to reason about, and disastrous in production. Refund eligibility lived in three-sentence paragraphs that straddled chunk boundaries — the retriever surfaced chunk A (“Refunds are available for…”) without chunk B (“…within 30 days of delivery unless the item is digital”). Golden-set Recall@5 on billing questions was 61%; 34% of misses traced directly to mid-sentence splits. After migrating to heading-aware semantic chunks, 15% overlap on procedural articles, and a parent-child index where small child chunks retrieve but parent sections feed the generator, Recall@5 rose to 89% and retrieval misses fell to 8% — same embedding model, same reranker, no prompt rewrite.
Chunking is the step most RAG tutorials skip and most production incidents blame last. It decides what unit of text your embedding model sees, what metadata filters can target, and whether a user question that mentions “Pro plan downgrade” can ever land on the paragraph that defines proration rules. This guide covers fixed vs semantic splitting, token budgets and overlap, structure-aware parsers (headings, tables, code), parent-child and small-to-big retrieval, chunk metadata design, overlap with contextual compression, the Harbor Support refactor, a technique decision table, pitfalls, and a checklist.
What chunking is (and is not)
Chunking splits source documents into retrieval units before embedding and indexing. It is not summarization (you keep verbatim text), not deduplication (though overlap creates near-duplicates you must manage), and not the same as context window budgeting for the generator — though the two interact when parent sections are passed downstream.
A chunk should be the smallest self-contained passage that still answers a class of questions. Too small and embeddings lose topic signal (“See Section 4.2” retrieves without Section 4.2). Too large and irrelevant sentences dilute the vector; the right paragraph ranks below a vaguely related page. Chunking is the primary lever for recall before you spend money on bigger models or cross-encoder rerankers.
Fixed-size chunking
Fixed chunking splits on character or token count with optional
separator priority (paragraph break, newline, space). Libraries like
LangChain's RecursiveCharacterTextSplitter walk a
separator list until each piece fits the budget.
When fixed chunks work
- Homogeneous prose with short paragraphs (news, blog posts).
- Prototypes where speed matters more than recall.
- Corpora already segmented by the author (one idea per paragraph).
Token budget starting points
For English help docs with modern embedding models (8k+ context), a common starting band is 256–512 tokens per chunk with 10–20% overlap. Harbor Support's 512-token zero-overlap default failed because procedural articles averaged 380 tokens per numbered step — steps 3 and 4 often landed in adjacent chunks with neither chunk embedding the full rule.
Smaller chunks (128–256) improve precision for FAQ-style Q&A but hurt questions that need two adjacent sentences. Measure on a golden set; do not copy blog defaults without testing.
Semantic and structure-aware chunking
Semantic chunking groups sentences or paragraphs until adding the next unit would drop embedding similarity below a threshold. It adapts chunk length to content density: a dense legal definition stays tight; a narrative onboarding guide expands.
Structure beats raw similarity
In practice, document structure often beats pure semantic clustering:
- Heading hierarchy — split on H1/H2/H3; never merge unrelated sections under one parent heading.
- List integrity — keep ordered lists intact; split between list items, not inside them.
- Tables — serialize rows with headers per table extraction guidance; one pricing row per chunk for SKU lookups.
- Code blocks — chunk per function or example; include language tag in metadata.
Harbor Support's semantic pass used markdown heading boundaries first, then sentence similarity only within each section. That alone recovered 19 points of Recall@5 before overlap or parent-child changes.
Overlap, parent-child, and small-to-big retrieval
Overlap windows
Overlap duplicates tail sentences from chunk N at the head of chunk N+1 so boundary concepts appear in both vectors. Typical overlap is 50–100 tokens on 400-token chunks. Tradeoff: higher recall on boundary facts, larger index size, and more reranker work. Pair overlap with dedup hygiene if the same sentence appears in dozens of chunks.
Parent-child indexes
Parent-child (or small-to-big) indexing embeds
child chunks (128–256 tokens) for precise retrieval but
returns the parent section (full H2 subtree, 800–2k
tokens) to the generator. Children carry parent_id metadata;
on hit, the pipeline fetches the parent text (or concatenates siblings).
This pattern fixed Harbor's refund policy misses: the child chunk “digital goods are non-refundable” retrieved with high similarity to “Can I refund a download?”, and the parent section supplied surrounding exceptions the model needed for a complete answer. Recall@5 gains came from children; faithfulness gains came from parents.
Sentence-window retrieval
A lighter variant stores single-sentence (or single-paragraph) embeddings but expands to ±2 neighboring sentences at query time. Lower index duplication than full overlap, but requires fast neighbor lookup by document position.
Metadata and chunk headers
Raw chunks lose document context when embedded alone. Prepend a context header to each chunk before embedding:
Document: Billing FAQ | Section: Pro plan downgrades
Chunk 2 of 4
When you downgrade from Pro to Basic mid-cycle...
Store filterable metadata separately: product,
locale, doc_version, last_updated,
content_type. Headers improve embedding quality; metadata
enables pre-filtering before vector search. Harbor added
article_id and heading_path so support agents
could trace bad answers to the exact source paragraph in the CMS.
Version metadata pairs with freshness decay so stale chunks from superseded policies rank lower after a doc update without a full re-embed of the corpus.
Harbor Support refactor walkthrough
The team changed four things in sequence, measuring Recall@5 and faithfulness after each:
- Heading-aware splits on markdown source — Recall@5: 61% → 80%.
- 15% token overlap on procedural articles only — 80% → 85%.
- Parent-child index for articles over 1,200 tokens — 85% → 89%; faithfulness 72% → 88%.
- Context headers with product and locale — precision@3 +6 points on multi-product queries.
They deliberately did not change the embedding model or generator until chunking stabilized — proving the failure was upstream. Re-embedding 18k articles took under two hours on a batch job; tuning chunk size without re-embed would have been impossible.
Technique decision table
| Technique | Best for | Weak when | Harbor-style signal |
|---|---|---|---|
| Fixed token chunks | Homogeneous prose, fast prototypes | Procedures, tables, cross-paragraph rules | High recall miss at sentence boundaries |
| Semantic similarity chunks | Mixed-density essays, research PDFs | Already structured docs with clear headings | Chunks merge unrelated topics |
| Heading/structure splits | Docs, wikis, policies, API references | Scanned PDFs without OCR structure | Recall jumps when H2 boundaries respected |
| Parent-child index | Long sections, need precision + context | Tiny FAQs where whole page fits in context | Good child hit, incomplete generator answer |
| Whole-document embed | Short emails, tickets under 500 tokens | Handbooks, code repos, multi-topic pages | Right doc ranks, wrong paragraph ignored |
| High overlap (30%+) | Boundary-heavy legal clauses | Cost-sensitive index size | Index 3× larger with marginal recall gain |
Common pitfalls
- Zero overlap on procedural text — Harbor's root cause; numbered steps split across chunks.
- Chunking PDF layout text — columns and footers interleave; OCR without reading order destroys semantics.
- Embedding without context headers — “It costs $29” retrieves without which product.
- One global chunk size — FAQs want 200 tokens; policy manuals want 600+ parents.
- Ignoring tables and code — treat as first-class chunk types, not prose afterthoughts.
- Rechunk without re-embed — stale vectors point at deleted chunk boundaries.
- Massive overlap without dedup — reranker drowns in near-identical hits.
- Tuning chunk size before golden set — teams chase MTEB leaderboard defaults that do not match their corpus.
Engineer checklist
- Inventory document types: prose, procedures, tables, code, tickets.
- Build a 50–200 question golden set with known source passages.
- Baseline fixed chunks (256/512 tokens) with 10–15% overlap.
- Add structure-aware splits on headings and list boundaries.
- Evaluate Recall@k and MRR per doc type; do not average blindly.
- Introduce parent-child for sections over ~1,000 tokens.
- Prepend context headers before embedding; store filter metadata.
- Version chunks with
doc_versionfor incremental re-embed. - Log chunk_id on each retrieval for support escalation traces.
- Re-run evaluation after CMS template changes.
- Pair with hybrid BM25 when exact tokens (SKUs, error codes) matter.
- Document chunk params in the index manifest for reproducibility.
Key takeaways
- Chunking is a recall lever — Harbor cut retrieval misses 34% → 8% before changing models.
- Structure beats blind token splits on docs, wikis, and policies.
- Parent-child balances precision and context for long sections.
- Overlap fixes boundary facts but inflates index size; tune per content type.
- Measure per stage with a golden set — chunking errors look like bad embeddings until you split the metrics.
Related reading
- RAG evaluation and retrieval metrics explained — Recall@k, faithfulness, and golden-set design
- LLM embeddings explained — how vectors represent chunked text
- Contextual compression for RAG explained — shrinking retrieved context after chunk selection
- RAG document deduplication explained — managing overlap and near-duplicate chunks