Guide

LLM RAG chunking strategies explained

Harbor Support indexed 18,400 help-center articles for a customer-facing refund and billing bot. The first pipeline used a naive 512-token fixed splitter with zero overlap: fast to implement, easy to reason about, and disastrous in production. Refund eligibility lived in three-sentence paragraphs that straddled chunk boundaries — the retriever surfaced chunk A (“Refunds are available for…”) without chunk B (“…within 30 days of delivery unless the item is digital”). Golden-set Recall@5 on billing questions was 61%; 34% of misses traced directly to mid-sentence splits. After migrating to heading-aware semantic chunks, 15% overlap on procedural articles, and a parent-child index where small child chunks retrieve but parent sections feed the generator, Recall@5 rose to 89% and retrieval misses fell to 8% — same embedding model, same reranker, no prompt rewrite.

Chunking is the step most RAG tutorials skip and most production incidents blame last. It decides what unit of text your embedding model sees, what metadata filters can target, and whether a user question that mentions “Pro plan downgrade” can ever land on the paragraph that defines proration rules. This guide covers fixed vs semantic splitting, token budgets and overlap, structure-aware parsers (headings, tables, code), parent-child and small-to-big retrieval, chunk metadata design, overlap with contextual compression, the Harbor Support refactor, a technique decision table, pitfalls, and a checklist.

What chunking is (and is not)

Chunking splits source documents into retrieval units before embedding and indexing. It is not summarization (you keep verbatim text), not deduplication (though overlap creates near-duplicates you must manage), and not the same as context window budgeting for the generator — though the two interact when parent sections are passed downstream.

A chunk should be the smallest self-contained passage that still answers a class of questions. Too small and embeddings lose topic signal (“See Section 4.2” retrieves without Section 4.2). Too large and irrelevant sentences dilute the vector; the right paragraph ranks below a vaguely related page. Chunking is the primary lever for recall before you spend money on bigger models or cross-encoder rerankers.

Fixed-size chunking

Fixed chunking splits on character or token count with optional separator priority (paragraph break, newline, space). Libraries like LangChain's RecursiveCharacterTextSplitter walk a separator list until each piece fits the budget.

When fixed chunks work

  • Homogeneous prose with short paragraphs (news, blog posts).
  • Prototypes where speed matters more than recall.
  • Corpora already segmented by the author (one idea per paragraph).

Token budget starting points

For English help docs with modern embedding models (8k+ context), a common starting band is 256–512 tokens per chunk with 10–20% overlap. Harbor Support's 512-token zero-overlap default failed because procedural articles averaged 380 tokens per numbered step — steps 3 and 4 often landed in adjacent chunks with neither chunk embedding the full rule.

Smaller chunks (128–256) improve precision for FAQ-style Q&A but hurt questions that need two adjacent sentences. Measure on a golden set; do not copy blog defaults without testing.

Semantic and structure-aware chunking

Semantic chunking groups sentences or paragraphs until adding the next unit would drop embedding similarity below a threshold. It adapts chunk length to content density: a dense legal definition stays tight; a narrative onboarding guide expands.

Structure beats raw similarity

In practice, document structure often beats pure semantic clustering:

  • Heading hierarchy — split on H1/H2/H3; never merge unrelated sections under one parent heading.
  • List integrity — keep ordered lists intact; split between list items, not inside them.
  • Tables — serialize rows with headers per table extraction guidance; one pricing row per chunk for SKU lookups.
  • Code blocks — chunk per function or example; include language tag in metadata.

Harbor Support's semantic pass used markdown heading boundaries first, then sentence similarity only within each section. That alone recovered 19 points of Recall@5 before overlap or parent-child changes.

Overlap, parent-child, and small-to-big retrieval

Overlap windows

Overlap duplicates tail sentences from chunk N at the head of chunk N+1 so boundary concepts appear in both vectors. Typical overlap is 50–100 tokens on 400-token chunks. Tradeoff: higher recall on boundary facts, larger index size, and more reranker work. Pair overlap with dedup hygiene if the same sentence appears in dozens of chunks.

Parent-child indexes

Parent-child (or small-to-big) indexing embeds child chunks (128–256 tokens) for precise retrieval but returns the parent section (full H2 subtree, 800–2k tokens) to the generator. Children carry parent_id metadata; on hit, the pipeline fetches the parent text (or concatenates siblings).

This pattern fixed Harbor's refund policy misses: the child chunk “digital goods are non-refundable” retrieved with high similarity to “Can I refund a download?”, and the parent section supplied surrounding exceptions the model needed for a complete answer. Recall@5 gains came from children; faithfulness gains came from parents.

Sentence-window retrieval

A lighter variant stores single-sentence (or single-paragraph) embeddings but expands to ±2 neighboring sentences at query time. Lower index duplication than full overlap, but requires fast neighbor lookup by document position.

Metadata and chunk headers

Raw chunks lose document context when embedded alone. Prepend a context header to each chunk before embedding:

Document: Billing FAQ | Section: Pro plan downgrades
Chunk 2 of 4

When you downgrade from Pro to Basic mid-cycle...

Store filterable metadata separately: product, locale, doc_version, last_updated, content_type. Headers improve embedding quality; metadata enables pre-filtering before vector search. Harbor added article_id and heading_path so support agents could trace bad answers to the exact source paragraph in the CMS.

Version metadata pairs with freshness decay so stale chunks from superseded policies rank lower after a doc update without a full re-embed of the corpus.

Harbor Support refactor walkthrough

The team changed four things in sequence, measuring Recall@5 and faithfulness after each:

  1. Heading-aware splits on markdown source — Recall@5: 61% → 80%.
  2. 15% token overlap on procedural articles only — 80% → 85%.
  3. Parent-child index for articles over 1,200 tokens — 85% → 89%; faithfulness 72% → 88%.
  4. Context headers with product and locale — precision@3 +6 points on multi-product queries.

They deliberately did not change the embedding model or generator until chunking stabilized — proving the failure was upstream. Re-embedding 18k articles took under two hours on a batch job; tuning chunk size without re-embed would have been impossible.

Technique decision table

Technique Best for Weak when Harbor-style signal
Fixed token chunks Homogeneous prose, fast prototypes Procedures, tables, cross-paragraph rules High recall miss at sentence boundaries
Semantic similarity chunks Mixed-density essays, research PDFs Already structured docs with clear headings Chunks merge unrelated topics
Heading/structure splits Docs, wikis, policies, API references Scanned PDFs without OCR structure Recall jumps when H2 boundaries respected
Parent-child index Long sections, need precision + context Tiny FAQs where whole page fits in context Good child hit, incomplete generator answer
Whole-document embed Short emails, tickets under 500 tokens Handbooks, code repos, multi-topic pages Right doc ranks, wrong paragraph ignored
High overlap (30%+) Boundary-heavy legal clauses Cost-sensitive index size Index 3× larger with marginal recall gain

Common pitfalls

  • Zero overlap on procedural text — Harbor's root cause; numbered steps split across chunks.
  • Chunking PDF layout text — columns and footers interleave; OCR without reading order destroys semantics.
  • Embedding without context headers — “It costs $29” retrieves without which product.
  • One global chunk size — FAQs want 200 tokens; policy manuals want 600+ parents.
  • Ignoring tables and code — treat as first-class chunk types, not prose afterthoughts.
  • Rechunk without re-embed — stale vectors point at deleted chunk boundaries.
  • Massive overlap without dedup — reranker drowns in near-identical hits.
  • Tuning chunk size before golden set — teams chase MTEB leaderboard defaults that do not match their corpus.

Engineer checklist

  • Inventory document types: prose, procedures, tables, code, tickets.
  • Build a 50–200 question golden set with known source passages.
  • Baseline fixed chunks (256/512 tokens) with 10–15% overlap.
  • Add structure-aware splits on headings and list boundaries.
  • Evaluate Recall@k and MRR per doc type; do not average blindly.
  • Introduce parent-child for sections over ~1,000 tokens.
  • Prepend context headers before embedding; store filter metadata.
  • Version chunks with doc_version for incremental re-embed.
  • Log chunk_id on each retrieval for support escalation traces.
  • Re-run evaluation after CMS template changes.
  • Pair with hybrid BM25 when exact tokens (SKUs, error codes) matter.
  • Document chunk params in the index manifest for reproducibility.

Key takeaways

  • Chunking is a recall lever — Harbor cut retrieval misses 34% → 8% before changing models.
  • Structure beats blind token splits on docs, wikis, and policies.
  • Parent-child balances precision and context for long sections.
  • Overlap fixes boundary facts but inflates index size; tune per content type.
  • Measure per stage with a golden set — chunking errors look like bad embeddings until you split the metrics.

Related reading