Guide

LLM map-reduce document processing explained

Harbor Legal ran due-diligence Q&A over 400-page merger agreements by stuffing the first 120k tokens into a 128k-window model. Counsel checklists asked about indemnity caps, change-of-control triggers, and earn-out cliffs — clauses that often sit past page 200. On a 50-deal eval set, the bot answered correctly on only 68% of checklist items; three missed indemnity caps would have cost the client $12M in uncapped exposure. Swapping to a map-reduce pipeline — parallel per-chunk extraction in the map phase, then hierarchical synthesis in the reduce phase — lifted full-document recall to 91% without upgrading the generation model. This guide explains when map-reduce beats single-pass prompts, how to design map and reduce prompts, recursive tree collapse for book-length corpora, cost controls, a Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.

Map-reduce for LLMs borrows the classic distributed computing pattern: split a document into chunks small enough for reliable attention, run an identical map function on each chunk in parallel, then reduce partial outputs into a final answer or summary. Unlike RAG, which retrieves only top-k chunks per query, map-reduce can touch every section of a fixed corpus — ideal for exhaustive diligence, legislative review, and contract comparison where omissions are costly.

Why single-pass long context fails

Modern models advertise 128k–1M token windows, but three constraints still push teams toward map-reduce:

Lost-in-the-middle recall — models overweight the beginning and end of a prompt; facts in the middle of a 80k-token paste are missed more often than facts at the edges. See our lost in the middle guide for the U-shaped attention curve.
Cost and latency — prefilling 200k tokens on every question burns input tokens even when only one section matters; map-reduce maps once per document revision and reuses cached partials.
Failure blast radius — one malformed table or OCR glitch in a giant paste can derail the entire answer; per-chunk maps isolate errors to a single section.

Map-reduce does not replace giant context windows — it composes with them. Each map call fits comfortably inside a 8–16k token budget where attention is reliable; reduce steps combine summaries that themselves may use a 32k+ window. The pattern trades more API calls for higher recall on exhaustive tasks.

Map phase: per-chunk extraction

Split the source document using the same discipline as RAG chunking: respect headings, keep 10–15% token overlap between adjacent chunks, and attach metadata (doc_id, page_range, section_path). For each chunk c_i, run a map prompt. Two common map modes:

Extraction map (QA and diligence)

Pass the user question (or a fixed checklist template) plus chunk text. Ask the model to return structured JSON: findings relevant to the question, direct quotes with page anchors, and null if nothing applies. Temperature 0; enforce a schema so downstream reduce can merge without parsing prose.

Summarization map (narrative collapse)

Ask for a dense partial summary of the chunk only — no cross-chunk inference. Cap output at 150–300 tokens using chain-of-density style instructions: entity-dense, no filler. Store map_output_i keyed by hash(doc_revision + chunk_offset + prompt_version).

Run maps concurrently with a worker pool (async batch API or 20–50 parallel chat completions). Typical chunk size: 2k–4k tokens of source text — small enough that the model attends to every sentence, large enough to keep API overhead manageable. A 400-page contract at ~500 tokens/page yields ~80 map calls; at $0.15/M input on a mid-tier model, the full map pass costs under $2.

Reduce phase: synthesis and recursive collapse

The reduce step consumes all map outputs and produces the deliverable. Three reduce patterns cover most production needs:

Single-shot reduce

Concatenate map JSON or summaries (with section headers) into one prompt and ask for a final answer or executive summary. Works when combined map outputs stay under ~60% of the model’s safe context budget after accounting for the question and system prompt.

Recursive tree reduce

When map outputs exceed one window, batch them into groups of 8–12 partials and run an intermediate reduce that synthesizes each batch. Repeat until one root summary remains. A 200-chunk book needs log₁₀(200/10) ≈ 2–3 reduce tiers. Each tier preserves citations by requiring the reduce prompt to carry forward source_chunk_id references rather than paraphrasing away provenance.

Map-reduce-refine

After the first reduce, run a lightweight refine pass: show the draft answer plus only the map chunks whose embeddings are nearest the draft (top 5). The refine model corrects omissions without re-reading the entire corpus. Useful when recursive reduce over-smooths numeric details.

For question answering, reduce merges extraction JSON with de-duplication rules: prefer the highest-confidence quote when two chunks cite the same clause; surface conflicts explicitly (“Section 8.2 says $5M; Exhibit C says $12M”) instead of averaging.

Map-reduce vs RAG vs full-document paste

Scenario	Best fit	Why
Ad-hoc questions over a 10k-token FAQ	Single-pass or small RAG	Corpus fits one window; map overhead not justified
Open-ended search across 50k wiki pages	RAG + rerank	Cannot map every page per query; retrieval selects candidates
Exhaustive checklist on one 300-page contract	Map-reduce extraction	Must read every section; omissions are legal risk
Executive summary of quarterly 10-K + exhibits	Map-reduce summarization	Hierarchical collapse preserves structure across filings
Real-time chat while user edits a doc	RAG on dirty draft + cached map partials	Re-map only changed chunks; retrieve for topical questions

Hybrid architectures are common: map once at ingest, store partial summaries in a vector index, and use RAG for exploratory questions while map-reduce runs on demand for audit checklists. Cached map outputs amortize cost across many reduce queries on the same document version.

Harbor Legal merger-review refactor

Harbor Legal’s M&A desk processed 40–60 active deals. Each data room dropped a purchase agreement, disclosure schedules, and material contracts into a shared S3 prefix. The refactor replaced single-pass Q&A with a five-stage map-reduce pipeline:

Structure-aware chunking — PDF layout parser split on headings and exhibit boundaries; 3.5k-token chunks with 400-token overlap; page anchors stored in metadata.
Checklist map — 22 diligence prompts (indemnity, MAC clauses, non-compete, IP assignment, etc.) × every chunk; async batch via JSON schema outputs; maps cached by deal ID + doc hash.
Per-question reduce — merge extraction JSON for one checklist item across all chunks; dedupe by clause reference; flag conflicts.
Counsel review UI — each answer linked to source quotes and page thumbnails; attorneys override false positives without re-running maps.
Incremental re-map — amended pages trigger map only for affected chunks; reduce recomputes from cached partials.

On the 50-deal eval set: checklist recall 68% → 91%; false-negative indemnity misses dropped from 6 deals to zero. End-to-end latency for a fresh 400-page agreement: 4.2 minutes (maps parallelized at 64 workers) vs 38 seconds for the old single-pass path — acceptable because diligence runs overnight, not in chat. Map cache hit rate on amended deals averaged 87%, cutting re-run cost by 8×.

Technique decision table

Goal	Prefer	Avoid
Must-not-miss exhaustive review	Map-reduce extraction with structured JSON maps	Single-pass truncation of tail pages
Book-length narrative summary	Recursive tree reduce on summarization maps	One-shot “summarize entire PDF”
Low-latency user chat	RAG on pre-mapped chunk summaries	Fresh full map-reduce per message
Tight ingest budget	Larger chunks + fewer maps; validate recall on eval set	500-token micro-chunks on clean prose
Numeric tables and exhibits	Table-aware chunking; extraction map with cell coordinates	Plain-text strip that destroys column alignment
Frequently amended documents	Content-hash cache per chunk map output	Full re-map on any byte change
Cross-document comparison	Map each doc independently; reduce joins on clause taxonomy	Concatenating two 200-page PDFs into one prompt

Common pitfalls

Map prompts that reason across chunks — maps must be strictly local; cross-chunk inference belongs in reduce or a dedicated compare step.
Reduce hallucination on conflicts — when two chunks disagree, reduce must surface both quotes, not blend into a compromise number.
Overlapping chunks without dedupe — the same indemnity paragraph in two overlaps yields duplicate findings; dedupe on normalized clause ID.
Ignoring map failures — one timed-out chunk silently drops a section; retry maps and block reduce until coverage hits 100%.
Prose-only map outputs — unstructured summaries are expensive to merge; JSON schemas pay for themselves in reduce reliability.
Recursive reduce that drops citations — each tier must carry source_chunk_id forward or audit trails break.
Re-mapping on every question — cache maps at document revision; run different reduce prompts against the same partials.
Chunk boundaries through tables — split mid-row and maps invent cell values; use layout-aware ingestion.

Production checklist

Define whether the task is exhaustive (map all chunks) or selective (RAG + map on demand).
Chunk with heading boundaries, overlap, and page/section metadata.
Design map prompts with JSON schema outputs and temperature 0.
Size chunks so each map call uses <50% of the model’s reliable attention band.
Parallelize maps with concurrency limits and per-chunk retry/backoff.
Cache map outputs keyed by document revision hash + chunk offset + prompt version.
Implement recursive reduce when combined map outputs exceed ~60% of context budget.
Dedupe and conflict-detect in reduce before presenting answers to users.
Attach source quotes and page anchors to every reduce finding.
Eval on a labeled set with tail-page and middle-page gold facts before production.

Key takeaways

Map-reduce lets LLMs process documents longer than a single reliable attention pass by extracting locally, then synthesizing globally.
Use structured extraction maps for diligence checklists; use dense summarization maps for narrative collapse.
Recursive tree reduce handles book-length corpora; cache map partials so reduce queries are cheap.
Map-reduce complements RAG: map for exhaustive coverage on fixed docs, RAG for open search across large libraries.
Harbor Legal lifted diligence recall from 68% to 91% on 400-page agreements with parallel checklist maps and cached incremental re-map.