Guide
LLM parent-child chunking explained
Harbor Legal’s contract handbook assistant answered “Can we terminate for convenience during the pilot phase?” with a paragraph about termination for cause in Section 14.2. The correct clause lived in Section 14.5 — three subsections away. The index used uniform 512-token chunks: the pilot carve-out sat at the end of one chunk and the convenience language at the start of the next. Dense retrieval surfaced the wrong half. On 120 attorney-curated questions, recall@5 was 71% but grounded answer accuracy was only 58% because retrieved passages lacked surrounding definitions.
After switching to parent-child chunking — embed small 200-token child units for search, then expand to 1,200-token parent sections before synthesis — recall@5 rose to 84% and accuracy to 79% with the same embedding model. This guide explains the precision-vs-context trade-off, index architecture, child and parent sizing, retrieval and deduplication flow, storage patterns, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist. It complements general chunking strategies, late chunking, and contextual retrieval for teams choosing how to split long documents.
Why one chunk size cannot win
Retrieval and generation want opposite things from text boundaries:
- Small chunks (100–300 tokens) produce tight embedding vectors. A question about a specific fee, date, or exception matches one focused passage. Precision is high; noise is low.
- Large chunks (800–2,000 tokens) preserve definitions, exceptions, cross-references, and numbered steps that span paragraphs. The model sees enough context to answer “under what conditions” questions without inventing qualifiers.
A single 512-token window forces a compromise: either you split mid-thought (hurting generation) or you keep paragraphs whole (diluting vectors with unrelated sentences). Parent-child chunking decouples the two jobs — also called small-to-big or hierarchical retrieval.
Architecture: two layers, one pointer
Every document is split twice at ingest:
- Parents — logical sections (H2/H3 in a handbook, article sections in a wiki, clause groups in a contract). Typically 800–1,500 tokens. Parents are not embedded in the vector index (or stored as optional secondary embeddings).
- Children — overlapping windows inside each parent.
Typically 150–300 tokens with 10–20% overlap. Each child
carries metadata:
parent_id,child_index,source_uri,section_title,doc_version.
At query time: embed the question, search the child index,
take top-k child hits, map to unique parent_id values, fetch
parent text from object storage or a document store, deduplicate, rank parents
by best child score, and pass parent bodies to the LLM. The model never sees
the tiny child slice alone unless you explicitly want citations at
sub-paragraph granularity.
Query → child ANN search (top 20)
→ parent_id lookup + dedup
→ fetch parent text (top 4–6 unique parents)
→ optional BM25 rerank on parents
→ LLM synthesis with citations
Designing child chunks
Children should be as small as precision allows without orphaning sentences:
- Respect structure first — split on paragraph or list-item boundaries inside the parent; only use fixed token windows when prose has no headings.
- Overlap — 32–64 tokens of overlap between adjacent children catches questions that sit on a boundary. More overlap increases index size linearly; sweep on eval data.
- Enrich child text for embedding — prepend
[Section: Termination > Pilot phase]or use contextual retrieval preambles on children only; parents stay clean for display. - Atomic units — tables, code blocks, and numbered procedures should not be split across children unless the parent itself is the atomic unit.
Harbor Legal used 220-token children with 40-token overlap inside parents bounded by handbook headings (never crossing an H2). Child count per parent averaged 4.2; the largest parent (indemnification) had eleven children.
Designing parent chunks
Parents define what the model reads after a hit:
- Section-aligned boundaries — match how authors structured the source. A parent that merges unrelated H2s reintroduces noise.
- Hard caps — if a section exceeds ~1,500 tokens, subdivide into sub-parents (H3-level) rather than embedding oversized parents at generation time.
- Include local context headers — store
breadcrumb: Handbook > Commercial terms > Terminationin parent metadata for citation UI, not necessarily in the LLM prompt if token budget is tight. - Version pinning — parent blobs keyed by
doc_id + version + parent_idso re-ingest does not serve stale children pointing at deleted parents.
Parents can also be indexed in a sparse BM25 collection for hybrid fusion: children win dense recall; parents catch exact statutory phrases that span multiple children. See hybrid search for fusion patterns.
Retrieval, deduplication, and ranking
A naive top-5 child search often returns five children from the same parent — wasting context slots. Production pipelines:
- Retrieve
3×kchildren (e.g., 15 for k=5). - Group by
parent_id; keep the max child score per parent. - Sort parents by that score; take top 4–6 parents within token budget.
- Optionally rerank parent full text with a cross-encoder or reranker before synthesis.
- Pass
child_idof the best hit per parent for inline citation anchors.
Multi-parent answers (comparing two policy sections) naturally fetch several parents without agentic loops — as long as the question requires at most 3–4 sections.
Storage patterns
| Pattern | Child store | Parent store | Notes |
|---|---|---|---|
| Dual collection | Vector DB (embeddings + parent_id) | Postgres / S3 JSON | Most common; clear separation of concerns |
| Payload-only parents | Vector DB with parent text in payload (not embedded) | Same row, flagged is_child |
Fewer moving parts; watch payload size limits |
| Embedded parents (optional) | Child vectors | Parent vectors in second namespace | Fallback coarse search when child recall fails |
Re-ingest must be atomic per document version: write parents first, then children, then flip a version pointer. Partial writes leave orphan children pointing at missing parents.
Harbor Legal handbook refactor
Corpus: 840-page commercial handbook, 2,400 pages of playbooks, quarterly version bumps. Previous pipeline: 512-token fixed chunks, single index, no overlap.
| Metric | Before | After parent-child |
|---|---|---|
| recall@5 (child search) | 71% | 84% |
| Grounded answer accuracy | 58% | 79% |
| Wrong-section citations | 23% | 9% |
| Indexed vectors | 48,200 | 62,100 |
| Median context tokens (p50) | 2,100 | 3,800 |
| p95 retrieval latency | 190 ms | 260 ms |
Changes: H2-aligned parents (~1,100 tokens median), 220-token children with 40-token overlap, contextual one-line preamble on children only, parent dedup before synthesis, cross-encoder rerank on top 8 parents. Accuracy gains outweighed the 37% vector count increase and modest latency bump.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Fixed-size single chunks | Short FAQs, <500 tokens per doc, uniform Q&A | Long handbooks, multi-step procedures, cross-references |
| Parent-child (small-to-big) | Structured long docs with clear sections; precision + context both matter | Chat logs, tweets, tiny snippets with no hierarchy |
| Semantic chunking only | Unstructured narrative without headings | Regulatory text where section numbers are legally meaningful |
| Late chunking | Long-context encoders; boundary context in embeddings | Cost-sensitive indexes, models without long embed windows |
| Contextual retrieval (enriched embed) | Children need situating text; parents stay human-readable | Extra LLM cost at index time is unacceptable |
| Large chunks only | Very small corpora, synthesis-heavy tasks | Fact lookup, SKU/policy ID search, high precision needs |
| Parent-child + hybrid BM25 | Exact clause numbers, SKUs, error codes in long docs | Two indexes to maintain; tuning fusion weights |
Common pitfalls
- Children without parent backfill — searching children but passing child text to the LLM loses definitions; always expand to parent.
- Parent boundaries that cross topics — merging two H2 sections into one parent because it fits 1,200 tokens reintroduces dilution.
- No deduplication — five children from one parent consume the entire context window redundantly.
- Stale parent_id after re-ingest — children updated without parent blob migration causes 404 context; version atomically.
- Embedding parent and child at same size — defeats the purpose; children must be materially smaller.
- Skipping eval on section-boundary questions — synthetic Q&A that never straddles chunk edges hides the failure mode parent-child fixes.
- Overlapping parents — unlike children, parents should not overlap; use nested H3 sub-parents instead.
Production checklist
- Define parent boundaries from document structure (headings, clauses, page sections) with a hard token cap per parent.
- Generate overlapping children inside each parent; store
parent_idon every child vector. - Store parent full text outside the vector index (or as non-embedded payload).
- Implement child search → parent dedup → parent fetch → synthesis pipeline.
- Optionally add contextual preamble on children at index time.
- Version parents and children together on each ingest run.
- Measure recall@k on children and grounded accuracy on parents separately.
- Log which parent_ids were expanded for debugging wrong citations.
- Cap parents per query to fit token budget; rerank when over limit.
- Document when to escalate to agentic multi-hop retrieval for cross-doc synthesis.
Key takeaways
- Small child chunks maximize retrieval precision; large parent passages preserve the context the LLM needs to answer correctly.
- Parent-child chunking decouples embedding granularity from generation context — the standard fix when fixed 512-token windows miss boundaries or dilute vectors.
- Always deduplicate child hits to unique parents before synthesis; otherwise redundant slices waste the context window.
- Harbor Legal raised grounded accuracy from 58% to 79% with H2-aligned parents, 220-token children, and parent expansion — without changing the embedding model.
- Pair parent-child with contextual child enrichment or hybrid BM25 when exact identifiers and paraphrase recall both matter.
Related reading
- RAG chunking strategies explained — fixed, semantic, and structure-aware splitting overview
- LLM late chunking explained — alternative that bakes boundary context into embeddings
- LLM contextual retrieval explained — index-time enrichment often applied to child chunks
- Hybrid search explained — BM25 plus dense fusion on child or parent indexes