Guide

LLM context compression explained

Harbor Legal's contract-review assistant ingested a 380-page SaaS master agreement — 412,000 tokens after OCR cleanup, nearly triple the 128K window of the production model. Naively truncating from page one dropped the indemnification and data-processing exhibits where liability actually lived. Stuffing the full PDF via a 1M-token preview model worked once but cost $4.80 per review and added 38 seconds of prefill latency. The refactor introduced a context compression pipeline: clause-aware chunking, embedding retrieval for the reviewer's question, abstractive summaries of non-matching sections, and a learned prompt compressor on the final bundle. Median input dropped to 18,400 tokens while attorney-rated recall on 40 red-flag clauses rose from 71% to 88%. Context compression is not “make the model smarter” — it is curating what enters the window so limited attention budget lands on signal.

Every production LLM app eventually hits the same wall: system prompts, tool schemas, chat history, retrieved documents, and the user message all compete for the same context window. Longer models help but raise cost and often degrade quality in the “lost in the middle” zone. Compression techniques — selection, summarization, pruning, and token-level squeezing — let you serve long corpora on standard windows. This guide covers the compression menu, token-budget accounting, hierarchical map-reduce for books and codebases, the Harbor Legal refactor, a technique decision table vs RAG and giant-context models, pitfalls, and a production checklist.

What context compression is

Context compression reduces the token footprint of information passed to an LLM while preserving task-relevant content. It operates at three layers:

  1. Selection — keep some chunks verbatim, drop the rest (retrieval, keyword filters, heading heuristics).
  2. Transformation — replace long passages with shorter representations (extractive highlights, abstractive summaries, structured outlines, tables of entities).
  3. Token pruning — remove low-information tokens from prompts without changing meaning (learned compressors like LLMLingua, whitespace normalization, deduplication of repeated boilerplate).

Compression differs from model training: weights stay fixed. You change the packaging of external knowledge and conversation state. The goal is higher information density per token — more correct answers per dollar and per second of prefill.

Token budget accounting

Before compressing, allocate the window explicitly. A typical 128K budget for a support agent might look like:

  • System + tools — 4–12K (persona, JSON schemas, few-shot examples)
  • Retrieved / compressed documents — 40–80K (largest flexible slice)
  • Conversation history — 8–24K (recent turns verbatim, older turns summarized)
  • User message — 1–8K
  • Reserved for output — 4–16K (never spend this on input)

Measure with the same tokenizer the provider bills on (often cl100k_base or model-specific BPE). Log input_tokens per stage in your pipeline so compression savings are auditable. If retrieved docs routinely exceed their slice, compression is mandatory — not an optimization.

Compression techniques

Extractive selection

Rank sentences or chunks by relevance to the query (BM25, dense embeddings, cross-encoder reranking) and pass the top-k verbatim. Preserves exact wording — critical for legal clauses, API signatures, and stack traces. Risk: misses context that only appears in discarded sections unless retrieval recall is high.

Abstractive summarization

An LLM (often a smaller, fast model) condenses each chunk or section into a bullet outline. See text summarization for extractive vs abstractive tradeoffs. Good for narrative reports, meeting notes, and background chapters. Risk: hallucinated details in summaries propagate to the final answer unless summaries are labeled as non-authoritative.

Hierarchical map-reduce

For corpora longer than one window: (1) chunk document into overlapping segments; (2) map — summarize each chunk independently; (3) reduce — merge chunk summaries into a global outline or answer; (4) optional refine pass with retrieve-and-read on cited sections. LangChain and LlamaIndex call this “map-reduce” or “refine” chains. Cost scales linearly with document length but stays within per-call limits.

Conversation and memory pruning

Multi-turn chats accumulate tokens. Strategies: sliding window (last N turns), summarization of older turns into a “state so far” block, or structured memory slots (user preferences, open tickets) instead of raw transcripts. Pair with agent memory tiers so compression does not erase facts the user already stated.

Learned prompt compression

Methods like LLMLingua and LongLLMLingua score token importance with a small language model and drop low-value tokens from prompts while keeping perplexity low on the task. Typical 2–5× compression on bloated prompts with modest quality loss on QA benchmarks. Apply after semantic selection — compressing irrelevant text still wastes attention.

Structured extraction instead of prose

Replace pages of logs with JSON event lists, tables of obligations, or AST snippets. A 2,000-line stack trace becomes 40 lines around the faulting frame. Domain-specific parsers often beat generic summarization for machine-generated input.

Compression vs RAG vs longer context

Approach Mechanism Best when Weak when
RAG (top-k retrieval) Embed query, fetch similar chunks verbatim Large corpus, question-specific lookup Answer requires synthesizing many distant sections
Hierarchical compression Map-reduce summaries + selective re-read Book-length docs, holistic summaries Need exact quoted language throughout
Prompt token pruning Drop low-importance tokens mechanically Already-relevant but verbose prompts Highly structured code or legal text
1M+ context models Stuff full document, pay prefill cost One-off deep reads, budget allows High volume, latency-sensitive, middle lost
Hybrid (RAG + compress) Retrieve, then summarize non-top hits Production Q&A over mixed corpora Team lacks eval harness for recall

Most mature systems combine layers: RAG for precision, summarization for breadth, pruning for boilerplate, and prompt caching on stable compressed prefixes to amortize prefill.

Harbor Legal contract review refactor

The MSA review flow before refactor: OCR entire PDF, attempt single-shot Q&A or truncate. After refactor:

  • Structure parse — detect headings, exhibits, and clause numbers; chunk on legal boundaries, not fixed token counts.
  • Index — embed each clause; store metadata (section type: liability, IP, termination, DPA).
  • Query routing — attorney question triggers hybrid retrieval (BM25 on defined terms + dense on semantic intent).
  • Bundle assembly — top-12 clauses verbatim (~9K tokens); next-30 as 2-sentence abstractive summaries (~4K); exhibit list as one-line stubs (~1K).
  • LLMLingua pass — 1.8× shrink on summaries only; never compress indemnity or limitation-of-liability verbatim blocks.
  • Answer + citations — model must cite clause IDs; UI links open full text from object storage, not from the compressed prompt.

Median input tokens fell 412K → 18.4K; p95 latency 38 s → 6.1 s. Attorney recall on the 40-clause red-flag set rose 71% → 88%; false omission rate on liability caps dropped from 19% to 6%. Cost per review fell roughly 85% vs the giant-context path.

Quality evaluation for compressed context

Compression that saves tokens but drops the one paragraph containing the answer is worse than no compression. Build eval sets with:

  • Needle-in-haystack probes — hide a unique fact in a long doc; measure whether compression retains it.
  • Expert-labeled Q&A — attorneys, analysts, or engineers score answers with and without compression.
  • Citation integrity — does the model quote clauses or lines that still exist in source files?
  • Ablation per stage — disable summarization, then pruning, then retrieval; attribute regressions to the right step.

Track compression ratio (original tokens / prompt tokens) and task F1 jointly. A 10× ratio with 5-point F1 loss may be acceptable; 2× ratio with 15-point loss is not.

Technique decision table

Approach Best when Skip when
Top-k RAG only FAQ, docs, code search with clear query intent Holistic doc understanding (executive summary)
Map-reduce summarization Reports, books, deposition transcripts Sub-20-page docs where RAG suffices
History summarization Long chat sessions, support tickets Short stateless turns
Learned token pruning Verbose retrieved bundles, repeated templates Code, JSON, legal definitions requiring exact tokens
Giant context + no compression Low-volume forensic review with budget Production scale or strict latency SLOs

Common pitfalls

  • Compress before retrieve — summarizing the whole doc once loses detail; retrieve first, compress only what did not rank.
  • Fixed-size chunking on structured docs — splits clauses mid-sentence; use domain boundaries (headings, functions, log sessions).
  • Summary treated as ground truth — model cites hallucinated summary facts; label summaries and require verbatim quotes for binding claims.
  • No output reservation — input fills 100% of window; completion truncates mid-sentence.
  • Compressing tool results — aggressive pruning on API JSON breaks field names; compress narrative, keep schemas intact.
  • Ignoring “lost in the middle” — even compressed prompts should place the highest-value chunks near the start and end of the context, not buried centrally.
  • Stale compressed caches — document updates but cached summary does not; version summaries with source etag or hash.

Production checklist

  • Define per-request token budget slices (system, docs, history, output).
  • Instrument token counts at ingest, post-retrieval, post-compression, and final prompt.
  • Chunk on semantic boundaries; store chunk metadata for citation links.
  • Combine retrieval (precision) with summarization (coverage) in a hybrid bundle.
  • Run map-reduce for corpora exceeding 2× the context window.
  • Summarize old chat turns; keep last 2–4 turns verbatim.
  • Apply learned pruning only to narrative text, never to code or legal quotes.
  • Build needle-in-haystack and expert Q&A evals; ablate each compression stage.
  • Cache stable compressed prefixes; invalidate on source document change.
  • Expose “view source” in UI so users verify beyond the compressed view.

Key takeaways

  • Context compression curates what enters the window — selection, transformation, and token pruning — without retraining the model.
  • Harbor Legal cut median contract-review input from 412K to 18.4K tokens while improving red-flag recall from 71% to 88%.
  • Hybrid pipelines (RAG + summarization + pruning) beat any single technique on long mixed corpora.
  • Allocate output tokens upfront; measure compression ratio and task quality together, not tokens alone.
  • Keep verbatim source accessible for citation — compression is a view, not a replacement for the original document.

Related reading