Guide

LLM lost in the middle explained

Harbor Legal's contract review assistant advertised a 200K-token context window. Counsel pasted a 90-page vendor MSA, asked whether the indemnity cap exceeded $2 million, and the model confidently answered “no cap found” — while the $5 million ceiling sat in Section 14.3 on page 47, squarely in the middle of the document. Legal approved three deals on bad reads before an associate spotted the pattern: facts at the start and end were cited accurately; anything buried mid-prompt was effectively invisible.

Lost in the middle is the empirical finding that large language models recall information from the beginning and end of a long context more reliably than from the center — even when the total prompt fits inside the advertised window. It is distinct from hitting a context limit (you ran out of tokens) and from poor chunking (you never retrieved the right passage). This guide covers the U-shaped recall curve, placement strategies for RAG and agents, the Harbor Legal refactor, a technique decision table vs giant-context models and compression, pitfalls, and a production checklist.

The U-shaped recall curve

Researchers measuring “needle in a haystack” retrieval across context lengths found a consistent pattern: when a single fact is inserted at varying positions in an otherwise irrelevant document, models answer correctly most often when the fact appears near the beginning or end of the prompt, and least often when it sits in the middle third. The accuracy curve is U-shaped, not flat — more context does not automatically mean equal attention to every token.

Two related biases drive the effect:

  • Primacy — early tokens (system instructions, first retrieved chunks, document headers) receive strong attention in prefill and anchor the model's framing.
  • Recency — tokens closest to the generation point (the latest user message, final summary, trailing citations) dominate what the decoder attends to when producing the answer.

Middle positions compete with thousands of surrounding tokens. Softmax attention spreads probability mass; training data skews toward answers that appear soon after a question; and positional encodings (even RoPE extensions on 128K+ models) do not fully equalize recall across depth. The gap has narrowed on newer models but has not disappeared in production benchmarks.

Where lost-in-the-middle shows up in production

Scenario Why middle content fails Symptom
Full-document paste User dumps PDF text; critical clause is page 40 of 80 Model cites intro and signature block; misses mid-body terms
RAG with naive top-k Retriever returns 20 chunks in score order; answer chunk ranks #11 Irrelevant high-TF-IDF chunks bookend the prompt; answer chunk drowned
Long chat history 50-turn thread; key constraint stated on turn 12 Model follows recent turns; forgets mid-conversation policy
Multi-file agent context Agent reads 8 files sequentially into one scratchpad File #4 findings never surface; files #1 and #8 dominate the summary
Tool observation logs ReAct loop appends 30 observations; bug evidence in observation #15 Agent retries wrong fix; cites first and last tool outputs only

The failure mode looks like hallucination or laziness but is often positional: the model never weighted the middle tokens enough to surface them in the answer.

Context placement strategies

1. Put the question and constraints at both ends

Repeat the user's task after retrieved content, not only before it. A trailing block — “Given the documents above, answer: [question]. Cite section numbers.” — pulls recency bias toward the actual ask. For system prompts, keep non-negotiable policies in the header and a one-line reminder before the user message.

2. Rerank, then reorder by relevance not score

After retrieval, run a cross-encoder reranker and place the top-3 chunks immediately before the user question. Demote lower-ranked chunks to an appendix section or drop them. Never concatenate chunks in arbitrary document order when the task is pinpoint QA.

3. Retrieve instead of stuffing

A 200K window is not a substitute for search. For contracts, logs, and codebases, use section-aware retrieval (headings, clause IDs, function names) so the answer material lands adjacent to the question. See context engineering for layout patterns that treat the prompt as scarce attention budget, not free storage.

4. Hierarchical and map-reduce pipelines

For documents too long to place optimally in one shot, summarize per section first, then answer from section summaries plus the top full-text excerpts. Map-reduce avoids a single middle-buried needle by ensuring each map step sees a short local window. Pair with compression when summaries must feed a second pass.

5. Conversation memory with explicit recall

Do not rely on turn 12 staying salient in turn 50. Persist constraints in a facts block or vector store and re-inject them each turn — the pattern used in agent memory systems. Treat mid-history as archived, not actively attended.

Harbor Legal contract review refactor

Harbor Legal's pre-refactor pipeline: OCR the MSA, concatenate pages in order, prepend a generic “review for risks” system prompt, append the lawyer's question. Average prompt: 62K tokens. Indemnity-cap questions failed 41% of the time when the cap clause sat between pages 30 and 60; 9% when on page 1 or the final schedules.

The refactor:

  1. Clause-indexed retrieval — parse sections by numbered headings; embed each clause separately; retrieve top 8 by question embedding plus keyword match on “indemnity”, “cap”, “liability”.
  2. Rerank + sandwich layout — cross-encoder rerank; place top 3 clauses directly above the user question; move remaining hits to a labeled appendix after the question.
  3. Dual instruction — system header sets review policy; identical one-sentence task restatement immediately before generation.
  4. Abstention gate — if no retrieved clause scores above threshold, respond “no cap language in retrieved sections” instead of scanning the full paste hallucination-free.

Post-refactor indemnity-cap accuracy on the same test set: 94%. Median prompt dropped to 11K tokens because full-document paste was removed. Legal escalations on missed caps fell from 3 per week to near zero over six weeks.

Technique decision table

Approach Best when Lost-in-middle risk
Giant context + full paste Single short doc, exploratory skim, model with strong long-context evals High for pinpoint facts in 50+ page bodies
RAG top-k in score order Large corpus, question-specific lookup Medium — fix with rerank and sandwich placement
Map-reduce / section summaries Books, audit logs, multi-hour transcripts Low if each map window is short; watch summary loss
Context compression Repeated long prefixes, token cost pressure Medium — compression can drop mid-document nuance
Agent memory + facts block Multi-turn tasks, evolving constraints Low when facts are explicitly re-injected each turn

Common pitfalls

  • Assuming window size equals recall quality. Fitting 100K tokens does not mean uniform attention across them.
  • Chronological chunk ordering in RAG. Document order is not relevance order; always rerank for the question.
  • One-shot full-repo context for coding agents. Middle files in the tree listing never get edited; retrieve by symbol reference instead.
  • Evaluating only on short prompts. Needle benchmarks at 4K context do not predict 64K production behavior; test at your real length.
  • Over-compressing before placement. Aggressive summarization removes the exact numbers (caps, dates, IDs) that mid-position retrieval would have preserved.
  • Ignoring recency in tool loops. The last observation steers the next action; surface critical mid-loop findings in a pinned facts block.

Production checklist

  • Needle-in-haystack eval at your production context length and model version.
  • Retrieval reranker in place before chunk concatenation.
  • Top relevant chunks placed immediately above the user question (sandwich layout).
  • Task and constraints restated after long context blocks.
  • Full-document paste disabled or gated behind section retrieval for QA tasks.
  • Chat history summarized or facts-blocked; mid-thread constraints not assumed sticky.
  • Agent scratchpads pin critical findings outside the append-only observation tail.
  • Abstention when retrieval confidence is low instead of guessing from weak middle signal.
  • Metrics: accuracy vs clause position, prompt token length, retrieval hit rank.
  • Regression tests include deliberately middle-buried gold answers.
  • Re-evaluate when switching models — long-context marketing != flat recall.
  • Document layout patterns in context engineering runbooks for the team.

Key takeaways

  • Recall is U-shaped, not uniform, across long prompts.
  • Place what matters next to the question — beginning and end win.
  • RAG without rerank and reorder repeats the middle-loss failure mode.
  • Giant context windows solve capacity, not attention fairness.
  • Test with middle-buried needles at production length before you ship.

Related reading