Guide

LLM parent-child chunking explained

Harbor Legal’s contract handbook assistant answered “Can we terminate for convenience during the pilot phase?” with a paragraph about termination for cause in Section 14.2. The correct clause lived in Section 14.5 — three subsections away. The index used uniform 512-token chunks: the pilot carve-out sat at the end of one chunk and the convenience language at the start of the next. Dense retrieval surfaced the wrong half. On 120 attorney-curated questions, recall@5 was 71% but grounded answer accuracy was only 58% because retrieved passages lacked surrounding definitions.

After switching to parent-child chunking — embed small 200-token child units for search, then expand to 1,200-token parent sections before synthesis — recall@5 rose to 84% and accuracy to 79% with the same embedding model. This guide explains the precision-vs-context trade-off, index architecture, child and parent sizing, retrieval and deduplication flow, storage patterns, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist. It complements general chunking strategies, late chunking, and contextual retrieval for teams choosing how to split long documents.

Why one chunk size cannot win

Retrieval and generation want opposite things from text boundaries:

Small chunks (100–300 tokens) produce tight embedding vectors. A question about a specific fee, date, or exception matches one focused passage. Precision is high; noise is low.
Large chunks (800–2,000 tokens) preserve definitions, exceptions, cross-references, and numbered steps that span paragraphs. The model sees enough context to answer “under what conditions” questions without inventing qualifiers.

A single 512-token window forces a compromise: either you split mid-thought (hurting generation) or you keep paragraphs whole (diluting vectors with unrelated sentences). Parent-child chunking decouples the two jobs — also called small-to-big or hierarchical retrieval.

Architecture: two layers, one pointer

Every document is split twice at ingest:

Parents — logical sections (H2/H3 in a handbook, article sections in a wiki, clause groups in a contract). Typically 800–1,500 tokens. Parents are not embedded in the vector index (or stored as optional secondary embeddings).
Children — overlapping windows inside each parent. Typically 150–300 tokens with 10–20% overlap. Each child carries metadata: parent_id, child_index, source_uri, section_title, doc_version.

At query time: embed the question, search the child index, take top-k child hits, map to unique parent_id values, fetch parent text from object storage or a document store, deduplicate, rank parents by best child score, and pass parent bodies to the LLM. The model never sees the tiny child slice alone unless you explicitly want citations at sub-paragraph granularity.

Query → child ANN search (top 20)
      → parent_id lookup + dedup
      → fetch parent text (top 4–6 unique parents)
      → optional BM25 rerank on parents
      → LLM synthesis with citations

Designing child chunks

Children should be as small as precision allows without orphaning sentences:

Respect structure first — split on paragraph or list-item boundaries inside the parent; only use fixed token windows when prose has no headings.
Overlap — 32–64 tokens of overlap between adjacent children catches questions that sit on a boundary. More overlap increases index size linearly; sweep on eval data.
Enrich child text for embedding — prepend [Section: Termination > Pilot phase] or use contextual retrieval preambles on children only; parents stay clean for display.
Atomic units — tables, code blocks, and numbered procedures should not be split across children unless the parent itself is the atomic unit.

Harbor Legal used 220-token children with 40-token overlap inside parents bounded by handbook headings (never crossing an H2). Child count per parent averaged 4.2; the largest parent (indemnification) had eleven children.

Designing parent chunks

Parents define what the model reads after a hit:

Section-aligned boundaries — match how authors structured the source. A parent that merges unrelated H2s reintroduces noise.
Hard caps — if a section exceeds ~1,500 tokens, subdivide into sub-parents (H3-level) rather than embedding oversized parents at generation time.
Include local context headers — store breadcrumb: Handbook > Commercial terms > Termination in parent metadata for citation UI, not necessarily in the LLM prompt if token budget is tight.
Version pinning — parent blobs keyed by doc_id + version + parent_id so re-ingest does not serve stale children pointing at deleted parents.

Parents can also be indexed in a sparse BM25 collection for hybrid fusion: children win dense recall; parents catch exact statutory phrases that span multiple children. See hybrid search for fusion patterns.

Retrieval, deduplication, and ranking

A naive top-5 child search often returns five children from the same parent — wasting context slots. Production pipelines:

Retrieve 3×k children (e.g., 15 for k=5).
Group by parent_id; keep the max child score per parent.
Sort parents by that score; take top 4–6 parents within token budget.
Optionally rerank parent full text with a cross-encoder or reranker before synthesis.
Pass child_id of the best hit per parent for inline citation anchors.

Multi-parent answers (comparing two policy sections) naturally fetch several parents without agentic loops — as long as the question requires at most 3–4 sections.

Storage patterns

Pattern	Child store	Parent store	Notes
Dual collection	Vector DB (embeddings + parent_id)	Postgres / S3 JSON	Most common; clear separation of concerns
Payload-only parents	Vector DB with parent text in payload (not embedded)	Same row, flagged `is_child`	Fewer moving parts; watch payload size limits
Embedded parents (optional)	Child vectors	Parent vectors in second namespace	Fallback coarse search when child recall fails

Re-ingest must be atomic per document version: write parents first, then children, then flip a version pointer. Partial writes leave orphan children pointing at missing parents.

Harbor Legal handbook refactor

Corpus: 840-page commercial handbook, 2,400 pages of playbooks, quarterly version bumps. Previous pipeline: 512-token fixed chunks, single index, no overlap.

Metric	Before	After parent-child
recall@5 (child search)	71%	84%
Grounded answer accuracy	58%	79%
Wrong-section citations	23%	9%
Indexed vectors	48,200	62,100
Median context tokens (p50)	2,100	3,800
p95 retrieval latency	190 ms	260 ms

Changes: H2-aligned parents (~1,100 tokens median), 220-token children with 40-token overlap, contextual one-line preamble on children only, parent dedup before synthesis, cross-encoder rerank on top 8 parents. Accuracy gains outweighed the 37% vector count increase and modest latency bump.

Technique decision table

Approach	Best when	Weak when
Fixed-size single chunks	Short FAQs, <500 tokens per doc, uniform Q&A	Long handbooks, multi-step procedures, cross-references
Parent-child (small-to-big)	Structured long docs with clear sections; precision + context both matter	Chat logs, tweets, tiny snippets with no hierarchy
Semantic chunking only	Unstructured narrative without headings	Regulatory text where section numbers are legally meaningful
Late chunking	Long-context encoders; boundary context in embeddings	Cost-sensitive indexes, models without long embed windows
Contextual retrieval (enriched embed)	Children need situating text; parents stay human-readable	Extra LLM cost at index time is unacceptable
Large chunks only	Very small corpora, synthesis-heavy tasks	Fact lookup, SKU/policy ID search, high precision needs
Parent-child + hybrid BM25	Exact clause numbers, SKUs, error codes in long docs	Two indexes to maintain; tuning fusion weights

Common pitfalls

Children without parent backfill — searching children but passing child text to the LLM loses definitions; always expand to parent.
Parent boundaries that cross topics — merging two H2 sections into one parent because it fits 1,200 tokens reintroduces dilution.
No deduplication — five children from one parent consume the entire context window redundantly.
Stale parent_id after re-ingest — children updated without parent blob migration causes 404 context; version atomically.
Embedding parent and child at same size — defeats the purpose; children must be materially smaller.
Skipping eval on section-boundary questions — synthetic Q&A that never straddles chunk edges hides the failure mode parent-child fixes.
Overlapping parents — unlike children, parents should not overlap; use nested H3 sub-parents instead.

Production checklist

Define parent boundaries from document structure (headings, clauses, page sections) with a hard token cap per parent.
Generate overlapping children inside each parent; store parent_id on every child vector.
Store parent full text outside the vector index (or as non-embedded payload).
Implement child search → parent dedup → parent fetch → synthesis pipeline.
Optionally add contextual preamble on children at index time.
Version parents and children together on each ingest run.
Measure recall@k on children and grounded accuracy on parents separately.
Log which parent_ids were expanded for debugging wrong citations.
Cap parents per query to fit token budget; rerank when over limit.
Document when to escalate to agentic multi-hop retrieval for cross-doc synthesis.

Key takeaways

Small child chunks maximize retrieval precision; large parent passages preserve the context the LLM needs to answer correctly.
Parent-child chunking decouples embedding granularity from generation context — the standard fix when fixed 512-token windows miss boundaries or dilute vectors.
Always deduplicate child hits to unique parents before synthesis; otherwise redundant slices waste the context window.
Harbor Legal raised grounded accuracy from 58% to 79% with H2-aligned parents, 220-token children, and parent expansion — without changing the embedding model.
Pair parent-child with contextual child enrichment or hybrid BM25 when exact identifiers and paraphrase recall both matter.