Guide

LLM long context vs RAG explained

Harbor Analytics built an internal policy assistant over 2,400 compliance PDFs totaling 18 million tokens. The first prototype concatenated the top 50 embedding-ranked chunks into a 128K context window and called it done. Demo questions about headline retention rules scored well. Production probes that buried the answer in chunk 27 — middle of the prompt — failed 41% of the time. Worse, when analysts asked about a subsidiary policy updated last week, the model cited a superseded version still sitting in the static paste because nobody re-ingested the corpus.

The team replaced full-context stuffing with scoped RAG: hybrid retrieval, cross-encoder rerank, freshness metadata, and a 12-chunk synthesis budget. Wrong-policy answers fell from 34% to 9%; p95 cost per query dropped 58% despite an extra retrieval hop. This guide explains what long context actually buys you, what RAG buys you, how lost-in-the-middle breaks naive stuffing, the Harbor Analytics refactor, a technique decision table versus map-reduce and compression, pitfalls, and a production checklist.

Two different problems: capacity vs selection

Teams conflate “our model has a 1M-token window” with “we can feed the model everything it needs.” Those are different capabilities:

  • Capacity — how many tokens fit in one forward pass without truncation. Governed by context window limits and KV-cache memory at serving time.
  • Selection — which tokens deserve attention for this specific question. Governed by retrieval, reranking, and prompt layout.
  • Freshness — whether the pasted text reflects the corpus version the user expects. Governed by index updates, not window size.

Long context solves capacity. RAG solves selection and freshness at scale. Stuffing an entire repo into the prompt gives you capacity without selection — the model still attends U-shaped ( primacy and recency bias ), so middle chunks are effectively invisible on hard queries.

The right question is not “RAG or long context?” but “what belongs in the prompt for this query, and what stays in the index?” Most production systems use both: retrieve a small, relevant set; place it in a long enough window to include citations and conversation history.

Long-context stuffing patterns

“Stuffing” means putting raw source text in the prompt without a retrieval layer filtering per query. Common variants:

Full-document paste

One contract, one paper, one ticket thread — entire file in context. Works when the document fits comfortably (under ~30% of the window), the task requires cross-references across sections, and the doc changes rarely. Breaks when users ask needle questions about clause 14.3 in a 400-page agreement; middle sections lose recall even if they fit technically.

Multi-document concatenation

Rank by embedding similarity, paste top-k until the budget fills. This is what Harbor Analytics shipped first. Problems: k is a guess; irrelevant chunks crowd out marginally useful ones; order follows similarity rank, not attention-optimal layout; no per-chunk freshness signal.

Repository-wide context (coding agents)

Tree listings plus “read these files” tool loops, or vendors that advertise whole-repo context. Files in the middle of the listing get edited less often in studies; symbol-targeted retrieval beats directory dumps for bugfix tasks.

When stuffing wins

  • Single artifact, holistic summary or redline across the whole doc.
  • Total tokens well under window with margin for instructions and output.
  • No index infrastructure yet; one-off analysis with human review.
  • Strong need for verbatim quotes spanning distant sections (pair with section-aware chunking, not naive paste).

RAG patterns and what they add

RAG externalizes selection: embed or sparse-index the corpus, retrieve top-k per query, optionally rerank, synthesize with citations. Benefits at Harbor scale:

  • Corpus larger than any window — 18M tokens indexed; 4K-token synthesis context per query.
  • Incremental updates — re-embed changed PDFs without rebuilding a mega-prompt template.
  • Metadata filters — subsidiary, effective date, jurisdiction before vector search.
  • Audit trail — log which chunk IDs grounded each answer for compliance review.

Costs: index storage, embedding pipeline, retrieval latency (typically 50–200 ms), and engineering for chunking quality. RAG does not eliminate hallucination; it reduces wrong-source answers when retrieval is good. Bad chunks in a small window hurt as much as bad chunks in a large one.

Cost, latency, and the hidden prefill tax

API pricing is usually per input token plus per output token. A 100K-token stuff costs ~25× a 4K-token RAG context on the same model tier. Prefill time scales roughly linearly with input length; p95 latency for Harbor’s 50-chunk paste was 8.4 s versus 2.6 s for 12-chunk RAG (same model).

Prompt caching changes the math for repeated prefixes: if every user query shares the same 80K-token document prefix, cache hits amortize cost. Harbor’s queries hit different policy subsets; cache hit rate was under 12%, so stuffing did not pay off. Caching favors single-doc assistants, not multi-corpus search portals.

Rule of thumb: compute cost_per_query = input_tokens × input_rate + output_tokens × output_rate for stuff vs RAG at your real k and conversation depth. If stuff wins on cost, still run needle evals at that length before shipping.

Recall, faithfulness, and hybrid layouts

Long context does not guarantee the model uses every token equally. Lost-in-the-middle effects mean a relevant paragraph at position 40% of the prompt may be ignored while the model answers from parametric memory or the first chunk. RAG mitigates by keeping k small and placing the highest-ranked chunks adjacent to the question (sandwich layout).

Hybrid approaches Harbor Analytics uses today:

  • RAG-first, paste-second — retrieve 8 chunks; if the user pins one full policy, append that doc only when under a token cap.
  • Map-reduce for holistics — per-section summaries via map-reduce for executive overview; RAG for clause-level QA on the same corpus.
  • Compression bridgecontext compression when a single doc must fit but exceeds budget; not a substitute for retrieval across thousands of files.

Harbor Analytics refactor (worked example)

Corpus: 2,400 policies, 18M tokens, weekly updates on ~40 files. Golden set: 120 questions with gold chunk IDs and effective dates.

Baseline: long-context stuff (top-50 chunks, 128K window)

  • Chunk recall@50 (gold in pasted set): 88%
  • Answer accuracy when gold was in set: 66% (middle-placement failures)
  • Wrong-policy / superseded citations: 34%
  • Mean input tokens: 94,000
  • p95 latency: 8.4 s; mean cost per query: $0.71 (frontier model)

After: hybrid RAG (BM25 + dense, rerank top 12, freshness boost)

  • Chunk recall@12: 91% (better ranker, smaller k)
  • Answer accuracy: 66% → 87%
  • Wrong-policy citations: 34% → 9%
  • Mean input tokens: 6,200
  • p95 latency: 2.6 s; mean cost: $0.30 (−58%)
  • Index rebuild for weekly delta: ~4 min automated vs manual re-paste

They kept full-document paste for one workflow: merger due-diligence redlines on a single 80-page agreement under 32K tokens with lawyer-supervised output. Everything else routes through RAG.

Technique decision table

Approach Best when Weak when
Full-doc stuff One artifact; holistic read; fits in <30% of window; rare updates Needle QA mid-document; corpus > window; frequent revisions
Top-k stuff (no index) Quick prototype; tiny corpus; demos only Production QA; middle-buried answers; compliance audit trails
RAG (retrieve + rerank) Large corpus; per-query relevance; freshness metadata; citation logging Single-doc holistic summary spanning entire narrative arc
Map-reduce Executive summary over one very long doc; hierarchical collapse Low-latency chat; precise clause citation without chunk alignment
Context compression Must fit one doc; budget tight; accept summarization risk Numeric caps, dates, IDs that compression drops
RAG + cached doc prefix Same long doc; many questions; cache hit rate high Diverse corpus queries; low prefix reuse

Default for multi-document knowledge bases: RAG with rerank. Add stuffing only for pinned artifacts or sub-window holistics. Re-evaluate when model families change — longer windows shift capacity, not selection.

Common pitfalls

  • Window size as architecture — choosing models by context length alone without retrieval for corpora > 2× budget.
  • Embedding rank equals attention rank — pasting top-50 without cross-encoder rerank; middle chunks still wrong order.
  • Static mega-prompt — weekly policy changes invisible until someone manually rebuilds the paste.
  • RAG with giant k — retrieving 100 chunks into a 200K window recreates stuffing failures with extra latency.
  • No needle eval at production length — 4K demos do not predict 128K behavior.
  • Ignoring output budget — 120K input leaves little room for chain-of-thought and citations on some APIs.
  • One-size-fits-all routing — merger redlines and FAQ search need different pipelines under one product.

Production checklist

  • Measure corpus token count vs median window; if corpus > 3× window, plan RAG.
  • Build golden QA with gold chunk IDs and effective-date labels.
  • Run needle-in-haystack tests at your actual stuff length before choosing paste.
  • Compare cost per query: stuff tokens vs RAG tokens + retrieval infra.
  • Implement freshness metadata (effective_date, doc_version) on index rows.
  • Rerank before synthesis; sandwich highest chunks near the user question.
  • Log chunk IDs per answer for audit and debug.
  • Route single-doc holistics to paste or map-reduce; multi-doc search to RAG.
  • Track wrong-source citation rate separately from fluency scores.
  • Re-benchmark when upgrading models — marketing context length != flat recall.

Key takeaways

  • Long context solves capacity; RAG solves selection and freshness — most products need both concepts, not either/or.
  • Stuffing top-k chunks recreates lost-in-the-middle failures at higher cost.
  • Harbor Analytics cut wrong-policy answers 34%→9% and cost 58% with reranked RAG over 12 chunks.
  • Prompt caching favors repeated doc prefixes; diverse search rarely amortizes mega-pastes.
  • Route by workflow: holistic single-doc vs multi-corpus QA need different pipelines.

Related reading