Guide

LLM RAPTOR hierarchical retrieval explained

Harbor Analytics indexed 1,800 equity-research PDFs for a buy-side desk. Analysts asked two kinds of questions: narrow ones (“What was Acme’s Q3 gross margin guidance?”) and thematic ones (“How did semiconductor capex commentary shift across the sector after the August export controls?”). Flat RAG over 512-token chunks hit 71% recall@10 on entity-specific queries but only 54% on cross-report themes — embeddings of individual paragraphs rarely matched a question phrased at portfolio level.

RAPTOR (Recursive Abstractive Processing for Tree-Organized Retrieval, Sarthi et al., 2024) builds a retrieval tree at index time: leaf nodes are embedded chunks; each internal node is an LLM summary of a semantically similar cluster. At query time, search runs across all levels of the tree, so a thematic question can match a high-level summary while a detail question still lands on raw text. Harbor Analytics shipped RAPTOR alongside their existing dense index; thematic recall@10 rose to 83% and entity-specific recall held at 69%. This guide covers recursive clustering, abstractive summarization, collapsed-tree retrieval, the Harbor refactor, a technique decision table versus Graph RAG and parent-child chunking, pitfalls, and a production checklist.

What RAPTOR adds beyond flat chunk retrieval

Standard vector RAG embeds fixed-size chunks and retrieves the top-k by cosine similarity. That works when the answer lives in one or two chunks with overlapping vocabulary. It struggles when:

Questions are abstract — “What macro risks did analysts flag?” does not lexically overlap with any single paragraph.
Evidence spans many chunks — sector themes appear as small mentions across dozens of reports.
Detail and theme coexist — the same pipeline must answer both “list every mention of inventory write-downs” and “summarize the inventory narrative.”
Chunk boundaries split concepts — a thesis paragraph cut in half loses embedding signal.

RAPTOR precomputes a hierarchy of summaries so retrieval can jump to the right abstraction level. Unlike query-time map-reduce, which re-summarizes on every question, RAPTOR amortizes summarization at index time and reuses the tree for all queries.

Index-time pipeline: from chunks to a retrieval tree

Leaf layer: chunk and embed

Split documents with your normal chunking strategy (often 256–512 tokens with overlap). Embed each chunk with the same model you will use at query time. These leaves are the finest retrieval granularity.

Recursive clustering

The original RAPTOR paper clusters leaf embeddings with a Gaussian mixture model (GMM), optionally after UMAP dimensionality reduction for stability on large corpora. Each cluster should be small enough that an LLM can summarize its members in one pass — typically 4–10 chunks per cluster at the first level. Clusters are not required to respect document boundaries; thematic clusters across reports are a feature, not a bug.

Abstractive summarization per cluster

For each cluster, prompt an LLM to write a concise summary that preserves facts, numbers, and named entities from member chunks. Embed the summary text. These summary nodes become children of a new layer. Repeat: cluster the summary embeddings, summarize again, until one root node remains or cluster count falls below a threshold. The result is a tree (often bushy rather than strictly binary) with leaves at the bottom and progressively more abstract nodes toward the root.

Storage and provenance

Store each node with: node ID, parent ID, child IDs, text (chunk or summary), embedding, depth level, and source chunk citations for leaves. Summary nodes should record which leaf IDs they cover so answers can drill down for verification.

Query-time retrieval: collapsed-tree search

RAPTOR does not traverse the tree structurally at query time in most implementations. Instead it uses collapsed-tree retrieval: embed the query, score similarity against every node in the tree (leaves and all summary levels), and take the top-k nodes. Those nodes are concatenated (with deduplication and token budgeting) into the LLM context window.

Why this works: a thematic query aligns with a high-level summary node; a specific query aligns with a leaf. A mixed query may retrieve one summary plus several supporting leaves. Some teams add a lightweight reranker or filter to drop redundant parent-child pairs when both appear in top-k (keeping the more specific leaf when the parent summary adds no new facts).

Tuning retrieval depth

Top-k per level — cap how many nodes from each depth enter the context to avoid stuffing only summaries or only leaves.
Similarity thresholds — discard summary nodes below a floor so generic root summaries do not dominate every query.
Hybrid fusion — combine collapsed-tree scores with BM25 on leaves for ticker symbols and exact figures.

Harbor Analytics refactor: research QA across abstraction levels

Harbor’s baseline was dense RAG with 400-token chunks and cross-encoder reranking. Entity-specific questions (ticker + metric + quarter) already performed well. The gap was thematic baskets: semiconductors, regional banks, and energy transition names each had 40–120 reports with scattered commentary.

The team built a three-level RAPTOR tree over the research corpus: ~180k leaf chunks, ~22k level-1 summaries, ~2.8k level-2 summaries, and a single root sector overview per rebuild. Index jobs ran weekly after the research feed updated; new reports triggered partial subtree rebuilds for affected tickers only. Collapsed retrieval used top-6 nodes with reranking; thematic recall@10 rose from 54% to 83% while entity-specific recall dipped slightly (71% to 69%) until they added BM25 hybrid fusion, which restored entity recall to 72%. Analyst time per thematic scan dropped an estimated 35%.

Technique decision table

Approach	Strength	Weakness	Best when
Flat chunk RAG	Simple, cheap index	Weak on thematic / multi-doc synthesis	FAQ, single-doc QA, exact lookup
RAPTOR tree	Multi-level retrieval without query-time summarization	Index cost; summary hallucination risk	Mixed detail + theme questions over stable corpora
Graph RAG	Explicit entities and relations	Heavy extraction ontology; graph maintenance	Contract portfolios, supply chains, legal graphs
Map-reduce at query time	Exhaustive per-question coverage	High latency and cost per query	One-off diligence on a single huge document
Parent-child chunking	Better local context windows	No cross-chunk thematic summaries	Long single documents; limited cross-doc themes

Common pitfalls

Summary drift — abstractive summaries invent or smooth numbers; enforce extractive anchors or citation back to leaf chunks.
Clusters too large — summarizing 30 chunks in one pass loses detail; cap cluster size and split with tighter GMM components.
Stale trees — updated reports without partial rebuild leave wrong themes in summary nodes; version trees with document timestamps.
Root summary dominates — the top-level node matches every vague query; apply depth penalties or minimum similarity cuts.
Redundant parent-child context — stuffing both summary and all its leaves wastes tokens; deduplicate or prefer leaves when scores tie.
Ignoring hybrid search — tickers, ISINs, and dollar amounts still need lexical retrieval on leaves.
Full rebuild on every ingest — large corpora need incremental clustering for affected subtrees only.
Evaluating only on leaf questions — RAPTOR's value is thematic; maintain a separate abstract-question gold set.

Production checklist

Chunk corpus with stable IDs; embed leaves with production retrieval model.
Choose cluster size (4–10 chunks) and max tree depth for your corpus scale.
Prompt summaries to preserve numbers, dates, and entity names; log summary inputs.
Store parent/child links and leaf provenance on every summary node.
Build collapsed-tree vector index over all nodes (single FAISS / pgvector table with depth metadata).
Add BM25 or sparse retrieval on leaves for exact-match queries.
Implement parent-child deduplication in context assembly.
Eval thematic and entity-specific question sets separately; track recall@k and citation accuracy.
Schedule incremental rebuilds when source documents change.
Monitor summary token cost; cache cluster summaries across rebuilds when membership is unchanged.
Expose retrieved node depth in the UI so users trust summary vs verbatim hits.

Key takeaways

RAPTOR builds a tree of LLM summaries over clustered chunks so retrieval can match both specific details and thematic abstractions.
Collapsed-tree search scores the query against every node level, not just leaves.
Index-time summarization amortizes cost versus per-query map-reduce.
Harbor Analytics raised thematic recall@10 from 54% to 83% on equity research with a three-level RAPTOR index plus hybrid BM25.
Graph RAG fits explicit relationship queries; RAPTOR fits soft thematic clustering without a hand-built ontology.
Guard against summary hallucination with leaf citations and separate eval sets for abstract questions.