Guide

RAG chunking strategies explained

In retrieval-augmented generation, the embedding model and vector database get most of the attention — but practitioners who ship production RAG systems learn a quieter lesson: chunking strategy often dominates answer quality more than which LLM you call or which reranker you bolt on. Split a policy manual at arbitrary 512-token boundaries and you bury the deductible table inside a chunk about office hours; split it by section heading and the same query retrieves the right paragraph on the first hop. This guide explains why chunks exist, how fixed and recursive splits work, when semantic and heading-aware chunking earn their compute cost, parent-child and small-to-big retrieval patterns, overlap and metadata design, document-type playbooks, and how to measure whether your ingestion pipeline actually improves retrieval-augmented generation before you tune prompts or swap models.

Why chunking exists: the retrieval unit problem

Language models have finite context windows, and embedding models have practical input limits (often 512–8,192 tokens depending on the model). You cannot stuff an entire knowledge base into every prompt. Instead, you index many small retrieval units — chunks — and fetch only the top-k matches per query.

Each chunk is a trade-off between two failure modes:

  • Chunks too large — embeddings average over heterogeneous content; a single vector poorly represents a ten-page PDF. Retrieved text also burns context budget, leaving less room for the answer and conversation history.
  • Chunks too small — local context disappears. A chunk that says "it applies only when condition X holds" without the preceding definition of X is useless even if the embedding match score is high.

Good chunking preserves semantic coherence (one idea per unit), keeps enough local context for the generator to reason, and attaches metadata (source, heading, page) so citations and access control survive retrieval. The goal is not minimal storage — it is maximal recall of the right evidence at reasonable latency and cost.

Fixed-size and recursive character splitting

The default in most frameworks (LangChain, LlamaIndex, Haystack) is a fixed-size splitter: break text every N characters or tokens, optionally with overlap so sentences split across boundaries appear in two adjacent chunks.

A typical starting point for English prose:

  • Chunk size: 300–800 tokens (roughly 1,200–3,200 characters)
  • Overlap: 10–20% of chunk size (50–150 tokens)
  • Tokenizer: match your embedding model's tokenizer when possible

Recursive splitting improves on blind fixed windows by trying separators in order — double newline, single newline, sentence boundary, space — before hard-cutting mid-word. That keeps paragraphs and list items intact when they fit within the budget. For markdown and HTML, parse structure first (headings, list items, code fences) and only recurse inside oversized sections.

Fixed/recursive chunking is fast, deterministic, and good enough for homogeneous docs (support tickets, chat logs, short articles). It fails on long structured documents where section boundaries carry meaning — legal contracts, API references, medical guidelines — unless you combine it with heading-aware logic below.

Document-aware and heading-aware chunking

Document-aware chunking respects the source format instead of treating everything as a flat string:

  • Markdown / HTML — split on h1h3 boundaries; keep code blocks atomic; never split a table row across chunks.
  • PDF — use layout-aware parsers (not raw text dump) to preserve columns, tables, and footnotes; attach page numbers to metadata.
  • JSON / YAML config — one chunk per logical object or key path, not per arbitrary token count.
  • Source code — chunk by function, class, or file; include import context in metadata or a parent summary.

Heading-aware chunking prepends ancestor headings to each child chunk: a paragraph under "3.2 Refund policy" ships as ## Refunds > ### Eligibility > [body text]. That gives the embedding model hierarchical signal and helps the generator disambiguate generic phrases ("the maximum limit") that appear in multiple sections.

Store section_path, heading_level, and source_url on every chunk. At query time you can filter ("only HR policies"), boost recent versions, or render citations as deep links to the exact section.

Semantic chunking: split when meaning shifts

Semantic chunking detects topic boundaries by comparing sentence (or paragraph) embeddings: when cosine similarity between consecutive units drops below a threshold, start a new chunk. Prose that wanders from product overview to pricing to security compliance gets three coherent units instead of one muddy vector or three arbitrary cuts through the pricing table.

Trade-offs:

  • Pros — higher recall on long unstructured text (essays, transcripts, research papers); fewer "right topic, wrong paragraph" misses.
  • Cons — extra embedding calls at ingest time; threshold tuning per corpus; non-deterministic boundaries if the embedding model changes.

Practical pattern: use semantic chunking for ingest of messy long-form content, cap maximum chunk size (merge micro-chunks, split giants), and fall back to recursive split for sections that still exceed the budget. Pair with the same embedding model you use for retrieval so ingest and query live in one vector space — see LLM embeddings explained for model choice and dimension trade-offs.

Parent-child and small-to-big retrieval

A common production pattern indexes small child chunks for precise retrieval but returns larger parent chunks (or full sections) to the LLM for generation:

  1. Split each document into parent sections (1,000–2,000 tokens).
  2. Split each parent into child chunks (200–400 tokens) with overlap.
  3. Embed and index children in your vector database.
  4. On query: search children, map hits to parent IDs, deduplicate, pass parent text to the generator.

Children maximize recall@k — tight vectors match specific facts. Parents restore context — definitions, qualifiers, and tables adjacent to the hit. Some systems use a two-hop "small-to-big" expand: retrieve small, then fetch surrounding window (±N tokens) from the raw document store without re-embedding parents.

Metadata must link child_id → parent_id → document_id. Without that graph, you cannot expand or cite correctly after ANN search returns only child UUIDs.

Overlap, metadata, and deduplication

Overlap tuning

Overlap duplicates text across chunk boundaries so a query that aligns with the last sentence of chunk n still retrieves chunk n+1 containing the payoff. Too little overlap loses boundary facts; too much bloats the index and increases duplicate hits in top-k, wasting context slots. Start at 15% overlap; measure recall on a labeled set of questions whose answers straddle known boundaries.

Metadata that survives retrieval

Minimum useful fields per chunk:

  • document_id, chunk_index, token_count
  • title, section_path, source_url
  • created_at / updated_at for freshness filtering
  • acl_tags or tenant ID for permission filtering before the LLM sees text

Optionally prepend a one-line context header to the embedded text: "From: API Reference / Authentication / OAuth scopes". That header is embedded with the body, improving match quality for ambiguous terms.

Deduplication

Overlap and parent-child expansion can return near-identical passages. Deduplicate by document ID + character span before stuffing context, or merge chunks with Jaccard similarity above 0.85 on token sets. Duplicate chunks are a leading cause of "the model ignored my context" — the prompt fills with repeated boilerplate and the model attends to none of it.

Chunking by document type

Corpus type Recommended approach Typical size
FAQ / support articles One chunk per Q&A pair or heading section 100–400 tokens
Technical docs (API, SDK) Heading-aware + code blocks atomic 300–600 tokens
Legal / compliance PDFs Layout parser + clause boundaries; parent-child Parents 1,500+ tokens
Chat / ticket logs One chunk per conversation turn or resolved thread 200–800 tokens
Research papers Semantic + section headings (abstract, methods, results) 400–1,000 tokens
Tables / spreadsheets Row groups or one chunk per table with schema summary Varies; never split rows

When lexical keywords matter (SKUs, error codes, person names), combine chunking with hybrid search — BM25 over the same chunks plus vector ANN — so exact-token queries are not lost to embedding smoothing.

Evaluating chunk quality before you tune the LLM

Measure retrieval in isolation. Build 50–200 (question, gold_document_id, gold_span) triples from real user questions or SME labeling. For each question:

  • Recall@k — is the gold chunk in the top k results?
  • MRR (mean reciprocal rank) — how high does the first correct chunk rank?
  • Context precision — of the chunks you pass to the LLM, what fraction are relevant?

Swap chunk size, overlap, and strategy; hold embedding model and k constant. If recall@5 jumps from 0.55 to 0.82 by switching from 128-token fixed splits to heading-aware 512-token chunks, your bottleneck was ingestion — not GPT-4 vs a smaller model. Log chunk IDs in production; when users thumbs-down an answer, check whether the right chunk was ever retrieved.

Strategy decision table

Signal Likely issue Try this
Right doc, wrong paragraph Chunks too large or poorly bounded Smaller chunks, semantic or heading splits
Retrieved text lacks definitions Chunks too small Parent-child, overlap, heading prefixes
Tables garbled in answers PDF text dump splitting rows Layout-aware parser; one chunk per table
Exact SKU/code never retrieved Embedding-only search Hybrid BM25 + same chunk index
Duplicate paragraphs in context High overlap or parent-child without dedup Deduplicate by span; lower overlap
Answers fine on short docs, fail on PDFs Format-blind splitting Document-aware pipeline per MIME type

Common mistakes

  • One global chunk size for every corpus — FAQs want per-question units; papers want section-aware splits.
  • Embedding with model A, querying with model B — re-embed everything on embedding model changes.
  • Ignoring tokenizer mismatch — character counts do not equal token counts; oversize chunks silently truncate at embed time.
  • No re-ingest on document update — stale chunks produce confident wrong answers; version metadata and incremental jobs are mandatory.
  • Skipping retrieval eval — prompt engineering cannot fix chunks that never surface in top-k.

Production checklist

  • Pick chunk strategy per document type (fixed, heading, semantic, parent-child).
  • Set size and overlap in tokens using the embedding model's tokenizer.
  • Attach document_id, section_path, URL, and timestamps to every chunk.
  • Index children, expand to parents (or windows) at query time if needed.
  • Deduplicate expanded context before LLM call; respect ACL filters on metadata.
  • Label 50+ QA pairs; track recall@k when you change chunk parameters.
  • Re-chunk and re-embed on source updates; log chunk IDs with each answer.

Related reading