Guide

LLM late chunking explained

Harbor Legal indexed 14,200 vendor contracts with a standard fixed-size chunker and a strong bi-encoder. Lawyers searched for “indemnification cap mutual carve-out” and got paragraph 7.4 — correct section, wrong meaning, because the definitions in section 1 never entered the embedding. Recall@5 on the golden query set sat at 71%. The team switched to late chunking: embed each full contract section in one forward pass, then mean-pool token vectors at paragraph boundaries. Recall@5 rose to 89% with the same chunk sizes, reranker, and vector database. Index time per document increased 2.3×, but legal search quality crossed the bar for production.

Late chunking (also called contextual chunk embeddings or embed-then-split) reverses the usual RAG ingest order. Classic early chunking splits text first and embeds each chunk in isolation. Late chunking runs a long-context encoder over a parent span — a section, chapter, or full page — and derives per-chunk vectors from token-level representations that already saw surrounding context. This guide explains why isolated chunks lose semantic signal, how late chunking works mechanically, encoder and context-window requirements, implementation patterns with modern embedding APIs, the Harbor Legal refactor, a technique decision table versus contextual retrieval and parent-child indexes, pitfalls, and a production checklist — building on embedding fundamentals and batch inference.

Early chunking vs late chunking

Most RAG tutorials teach early chunking: split, embed, upsert. Each chunk string is tokenized and passed through a bi-encoder independently. Attention inside the encoder is bidirectional, but only within that chunk. A liability cap paragraph that says “subject to Section 1.3” encodes those words without the definition of “Losses” three pages earlier.

Late chunking keeps the same chunk boundaries for retrieval granularity but changes when embeddings are computed:

Stage	Early chunking	Late chunking
1. Boundary detection	Split document into chunks	Split document into chunks (same step)
2. Encoder input	Each chunk alone	Parent span containing multiple chunks (section/page)
3. Vector output	One vector per chunk from isolated forward pass	One vector per chunk by pooling token embeddings at boundaries
Context in vector	Only text inside chunk	Full parent span attended before pooling

Late chunking is not a replacement for good chunking strategy — garbage boundaries still produce garbage units. It fixes a different failure mode: vectors that are locally coherent but globally ambiguous.

How late chunking works mechanically

Modern embedding models (Jina embeddings v3, Nomic Embed v1.5, some E5 and BGE variants) expose token-level hidden states or a late_chunking flag in their API. The pipeline:

Choose a parent span — typically a logical section under the encoder’s context limit (2k–8k tokens for many models).
Tokenize the full parent — one forward pass produces a hidden vector per token (or per subword).
Map chunk boundaries to token ranges — record start/end token indices for each chunk inside the parent.
Pool within each range — mean-pooling is standard; some implementations use last-token or CLS-style pooling at chunk end.
L2-normalize — match your index metric (cosine vs dot).

Pooling strategies

Method	Behavior	When to use
Mean pool	Average all token vectors in chunk range	Default; stable for variable-length chunks
Last-token pool	Use final subword vector in range	Models trained with EOS-style pooling
Weighted pool	Down-weight padding and special tokens	Long chunks with heavy truncation risk

Query-time embedding stays unchanged: the user question is short and does not need late chunking. Only the index path uses parent-span encoding. This asymmetry is fine — bi-encoders are already asymmetric when models use query: / passage: prefixes.

Encoder and context-window requirements

Late chunking only helps when the parent span actually fits in one forward pass. A 512-token embedding model forced to truncate a 4,000-token section behaves like early chunking on the truncated tail.

Context length — parent span must be ≤ model max_length. Size parents to sections, not whole 80-page PDFs.
Token-level output support — sentence-transformers output_hidden_states=True, TEI with late-chunking endpoint, or vendor APIs with dimensions + pooling controls.
Deterministic tokenization — boundary indices must use the same tokenizer as the encoder; off-by-one misalignment smears vectors.
Memory — one long forward pass uses more activation memory than many short ones; see batch sizing.

If sections routinely exceed 8k tokens, combine late chunking with hierarchical parents: late-chunk within each section, store section metadata for filtering, or fall back to contextual retrieval preambles on early-chunked vectors.

Harbor Legal contract ingest refactor

Harbor Legal’s corpus is Markdown exported from Word: nested headings, cross-references, and defined terms. Early chunking at 400 tokens with 50-token overlap produced 312,000 index rows. Cross-reference phrases (“as defined in Article I”) appeared in chunks whose definitions lived elsewhere.

The refactor:

Split on h2/h3 boundaries first (structure-aware parents).
Sub-split parents into 350–450 token chunks for retrieval granularity.
Ran Jina embeddings v3 with late_chunking=True per parent (max 8,192 tokens).
Mean-pooled token vectors at sub-chunk boundaries; upserted to existing Qdrant collection (same 1,024 dimensions).
Reused the same cross-encoder reranker; only index vectors changed.

Results on 240 lawyer-annotated queries:

Metric	Early chunking	Late chunking
Recall@5	71%	89%
MRR@10	0.54	0.72
Ingest time / doc	1.0× baseline	2.3× baseline
GPU memory peak (ingest)	4.1 GB	6.8 GB

The team kept early chunking for FAQ snippets under 200 tokens where context dependency is minimal — a hybrid ingest path is common.

Technique decision table

Your situation	Prefer	Why not late chunking alone
Chunks reference distant definitions (legal, policy, specs)	Late chunking within sections	—
Parents exceed encoder context	Contextual retrieval preambles or parent-child index	Truncation drops the context you need
Need sub-100ms ingest on huge corpus	Early chunking + batch inference	Late chunking costs more per parent forward pass
Retrieval unit is whole page, not paragraph	Early chunking at page level	Pooling adds complexity without benefit
Model lacks token-level API	Contextual retrieval or query expansion (HyDE)	Cannot pool without hidden states
Code with import-dependent chunks	Late chunking per file or function block	Early chunks miss symbol definitions

Late chunking and contextual retrieval solve overlapping problems differently: late chunking bakes context into the vector via attention; contextual retrieval prepends generated text before a standard embed. Stacking both without ablation often yields diminishing returns and doubles ingest cost.

Pitfalls

Parent span too large — silent truncation at max_length drops tail chunks back to context-free embeddings.
Boundary tokenizer mismatch — character offsets from a Markdown splitter mapped with the wrong tokenizer misaligns pools.
Pooling padding tokens — including pad positions in mean pool dilutes vectors; mask pads explicitly.
Applying late chunking to queries — wastes compute; queries are already short and self-contained.
Ignoring overlap semantics — overlapping chunks in the same parent share token representations; dedupe at retrieval if needed.
Skipping ablation — late chunking costs more; measure recall@k on a golden set before full reindex.
Wrong parent granularity — whole-document parents on 50-page PDFs fail; section-level parents work.

Production checklist

Audit golden queries where chunks reference out-of-span definitions.
Confirm embedding model supports token-level output or late-chunking API.
Define parent spans by document structure (heading, page, slide).
Verify parent token count ≤ max_length with headroom.
Map chunk boundaries to token indices with the encoder’s tokenizer.
Mean-pool with padding masked; L2-normalize consistently with index metric.
Keep query embedding path standard (no late chunking at search time).
Benchmark recall@k vs early chunking before full corpus reindex.
Monitor ingest GPU memory; reduce parent size or batch width if OOM.
Log truncation events per parent span.
Hybrid path: early chunk for short standalone snippets, late chunk for dense sections.

Key takeaways

Early chunking embeds isolated text; late chunking pools token vectors after the encoder sees the full parent span.
Cross-references and defined terms are the classic failure mode late chunking fixes.
Parent spans must fit the encoder context window — section-level parents beat whole-document passes.
Harbor Legal recall@5 rose from 71% to 89% with the same chunks and reranker; ingest time rose 2.3×.
Compare against contextual retrieval and parent-child indexes before stacking techniques.