Guide

LLM agentic chunking explained

Harbor Legal's employee handbook RAG index held 1,240 pages of policies, benefits tables, and jurisdiction-specific addenda. Engineers used fixed 512-token windows with 64-token overlap — the standard recipe from most vector DB tutorials. Retrieval recall looked fine on keyword probes, but wrong-section answers hit 29% on a 180-question golden set: a chunk about “parental leave in California” often started mid-sentence with “except as noted in Section 4.2” while the actual exception lived in the previous chunk. Embeddings could not fix boundaries that split atomic policy statements in half.

The team replaced blind token splitting with agentic chunking: a small LLM reads each document section, proposes self-contained proposition chunks (one claim or rule per unit), and a validator agent rejects overlaps that break referential integrity. Wrong-section retrieval fell to 7%; answer faithfulness on the same holdout rose from 71% to 86%. This guide covers the agentic chunking pipeline, proposition extraction prompts, boundary validation, overlap policy, the Harbor Legal refactor, a technique decision table versus fixed and semantic chunking and parent-child indexes, pitfalls, and a production checklist.

What agentic chunking changes in the ingest stack

Traditional chunking is a deterministic function of tokens or characters. Agentic chunking inserts an LLM boundary agent between parsed text and the embedding step:

  1. Structure parse — headings, lists, tables, and footnotes from layout-aware extraction (see document ingestion).
  2. Section scoping — feed one logical section (e.g. H2 subtree) per agent call; never whole PDFs in one prompt.
  3. Proposition emit — model outputs JSON array of chunks, each with text, summary title, and dependency hints (“requires Section 4.2”).
  4. Validator pass — second prompt or rule engine rejects chunks that start with dangling pronouns, split conditionals, or orphan table rows.
  5. Overlap stitch — optional 1–2 sentence bridge copied from adjacent chunks when cross-references demand it.
  6. Embed + index — same bi-encoder pipeline as any RAG stack; chunk quality is upstream.

The agent does not retrieve at query time. It runs once per document version during ingest (or on incremental index updates), which makes cost predictable if you batch by section.

Proposition chunks vs token windows

A proposition chunk is the smallest text unit that can answer one factual question without external context. Examples from Harbor Legal:

  • Good: “California employees accrue 12 weeks parental leave per birth or adoption, paid at 60% base salary after a 7-day waiting period.”
  • Bad (split mid-rule): “California employees accrue 12 weeks parental leave per birth or adoption, paid at 60% base salary [chunk ends] after a 7-day waiting period unless the employee is part-time as defined in Section 4.2.”

Fixed-size chunking cannot distinguish these cases without expensive overlap (which duplicates embeddings and confuses ranking). Embedding-based semantic clustering groups similar sentences but still merges unrelated bullets when cosine distance is low. The boundary agent encodes linguistic rules: complete conditionals, intact list items, table rows as atomic units, and explicit cross-ref metadata instead of hoping overlap catches orphans.

Typical agent output schema

{
  "chunks": [
    {
      "id": "handbook-ca-leave-01",
      "title": "CA parental leave accrual",
      "text": "California employees accrue 12 weeks...",
      "refs": ["section-4.2-part-time"],
      "token_estimate": 48
    }
  ]
}

Store refs as metadata for pre-filter routing or for a synthesis step that pulls linked parent sections when a chunk lists unresolved dependencies.

Boundary-agent loop design

Production agentic chunking is not one-shot “split this PDF.” Harbor Legal used a three-step loop per section:

1. Draft split

System prompt anchors role (“policy corpus chunker”), output JSON schema, max chunk token budget (they used 120–180 tokens target, hard cap 256), and examples of valid vs invalid splits from the same domain. User message is markdown section text with heading breadcrumbs in metadata.

2. Validator critique

A second call (same or smaller model) receives draft chunks and flags: dangling “this/that/except”, split enumerations, table headers separated from rows, numeric thresholds cut from their units. Flagged chunks return to step 1 with critique appended — usually one retry suffices.

3. Merge micro-chunks

If the draft emits fragments under 40 tokens (“See Section 4.2.”), a deterministic merge step attaches them to the preceding chunk or expands refs without re-running the full agent. This prevents index noise from stub chunks.

Token cost at Harbor: ~$0.04 per handbook page on a 8B instruct model with JSON mode — acceptable for a 1,240-page corpus that updates quarterly vs daily query volume.

Overlap, context enrichment, and parent passages

Agentic chunks are intentionally smaller than fixed 512-token windows. Precision improves; synthesis sometimes needs wider context. Harbor stacked three patterns (not mutually exclusive):

  • Bridge overlap — copy the final sentence of chunk n into the start of chunk n+1 only when chunk n+1 begins with “However,” “Except,” or a pronoun referring backward.
  • Contextual retrieval headers — prepend Document > Section > Subsection to each chunk before embedding (see contextual retrieval).
  • Parent-child index — embed small proposition children; store full H2 section as parent for generator context after retrieval hits a child (pairs naturally with parent-child chunking).

Sentence-window retrieval is an alternative when propositions are already single-sentence granular; agentic chunking generalizes to multi-sentence rules that must stay together.

Harbor Legal refactor (worked example)

Before: 512/64 fixed chunks, bi-encoder e5-large, hybrid BM25 fusion, no reranker. Wrong-section top-1 rate 29%; faithfulness 71%.

After agentic chunking only (same embedder, same index): wrong-section 11%; faithfulness 79%.

After agentic + contextual headers + parent-child: wrong-section 7%; faithfulness 86%; index size +18% (more, smaller chunks).

What did not help: running the boundary agent on raw PDF text without layout parse — tables were mis-read and produced nonsense propositions. Ingest quality gates matter more than prompt tuning.

Latency note: full handbook re-chunk took 4.2 hours batch on one GPU worker; incremental updates re-chunk only changed sections via content hash diff.

Technique decision table

Your corpus looks like Prefer Why not agentic alone
Uniform prose blogs, news Fixed or semantic clustering Agent cost adds little over good overlap
Policies, contracts, compliance manuals Agentic proposition chunking
Source code repositories AST / function chunking LLM boundaries miss symbol scope; use codebase RAG
Tables and financial filings Table-aware row chunking + agent for prose Agent on raw tables hallucinates cell groupings
Very large sections (>8k tokens) Structure split first, agent per subsection Single-call context limits and quality drift
Sub-second ingest SLA, billions of docs Fixed chunking + reranker at query time Agentic is offline-batch friendly, not real-time firehose

Stack order that worked at Harbor: layout parse → agentic chunk → contextual headers → parent-child index → hybrid retrieval → cross-encoder rerank.

Common pitfalls

  • Whole-document prompts — quality collapses past ~6k tokens; scope by heading tree.
  • No validator pass — draft splits look plausible in JSON but fail on dangling references; always audit a sample manually.
  • Stub chunks — “See above” fragments pollute ANN space; merge or drop under token floor.
  • Skipping layout parse — multi-column PDFs and tables produce garbage propositions.
  • Non-deterministic ingest — temperature >0 on boundary agent changes chunk IDs between runs; use temperature 0 and version chunk manifests.
  • Agentic without eval — smaller chunks can hurt recall on broad questions; measure section accuracy and recall@k together.
  • Memorizing chunk titles in the agent — titles must describe content, not copy heading text only; hurts embedding discrimination.
  • Ignoring cross-doc refs — store refs metadata; otherwise synthesis invents bridge text between policies.

Production checklist

  • Parse structure (headings, tables, lists) before any LLM chunk call.
  • Scope agent input to one H2/H3 subtree or <4k tokens.
  • Define JSON schema with title, text, refs, token_estimate fields.
  • Run validator pass on dangling references and split conditionals.
  • Merge or drop chunks under minimum token threshold (e.g. 40 tokens).
  • Add contextual breadcrumbs or parent-child for synthesis context.
  • Version chunk manifests with document content-hash for incremental re-ingest.
  • Benchmark wrong-section rate and faithfulness, not just MRR.
  • Keep temperature 0 on boundary agents; log prompts for regression tests.
  • Fall back to structure-aware fixed chunks if agent fails validation twice.

Key takeaways

  • Agentic chunking uses an LLM offline to propose semantically complete chunk boundaries, not at query time.
  • Proposition chunks keep conditionals, list items, and table rows intact — the main win over token windows.
  • A validator pass catches dangling references that embeddings cannot repair.
  • Harbor Legal cut wrong-section retrieval from 29% to 7% without changing the embedder.
  • Pair agentic children with parent passages or contextual headers when chunks are intentionally small.

Related reading