Guide
LLM contextual compression for RAG explained
Harbor Support's RAG bot retrieved the right runbooks — but each chunk was 600–900 tokens of full procedure text, boilerplate warnings, and unrelated troubleshooting branches. Stuffing the top-12 hits into a 32K context window left 9,400 tokens of noise. The generator cited step three from the wrong doc, blended two SAML guides, and scored 62% on attorney-reviewed faithfulness. Engineers added a contextual compression stage after hybrid retrieval: every candidate chunk was filtered and extracted with the user question in view before assembly. Median stuffed context fell 74% (9,400 to 2,450 tokens); faithfulness rose to 84%; p95 latency grew only 180 ms because compression ran on a small model in parallel.
Contextual compression is post-retrieval, pre-generation pruning: unlike indexing-time parent-child chunking or whole-document context compression, it operates on already retrieved passages and uses the live query to decide what survives. This guide covers embedding filters, LLM extractors, rerank-then-compress ordering, citation preservation, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
Where contextual compression sits in RAG
A standard RAG pipeline is retrieve → stuff → generate. Retrieval optimizes recall (get relevant docs into the candidate pool). Reranking optimizes precision (order candidates). Neither guarantees that every token in a chunk helps answer this specific question. Long chunks improve recall at index time but punish generation: models lose signal in the middle, hallucinate when distracted by adjacent procedures, and burn budget that could hold more diverse sources.
Contextual compression is a fourth stage:
- Retrieve — hybrid BM25 + dense (optionally HyDE) returns top-k chunks.
- Rerank (optional) — cross-encoder scores query–passage pairs.
- Compress contextually — per chunk, drop or extract only query-relevant spans.
- Assemble and generate — pack compressed chunks within token budget; answer with citations.
The key word is contextual: the same chunk about “VPN troubleshooting” might compress to three sentences on “split-tunnel routes” and eight sentences on “certificate renewal errors.”
Compression techniques
Embedding similarity filter
The cheapest filter embeds the user query and each sentence (or fixed sub-span) inside a retrieved chunk, keeps spans above a cosine threshold, and concatenates survivors. No extra LLM call; latency is one embedding batch per chunk. Works when chunks are structurally similar but only one paragraph matches. Weak when relevance requires cross-sentence reasoning (“unless clause B applies, ignore step 4”).
LLM chain extractor
A small model receives the question plus the full chunk and returns only excerpts that help answer it — often with explicit “extract verbatim, do not paraphrase” instructions to preserve citations. Harbor Support used a 3B extractor on chunks already reranked by a cross-encoder; extractors ran concurrently with a 4-wide worker pool. Extraction prompts included the chunk ID and source URL so downstream citation assembly stayed stable.
LLM relevance filter (drop whole chunks)
Before per-chunk extraction, a binary classifier pass can discard entire chunks: “Does this passage contain information needed to answer the question? Yes/No.” Cheaper than extract-then-stuff when many reranked hits are marginally related. Harbor dropped 38% of post-rerank chunks this way, saving extractor calls.
Learned prompt compressors
Tools like LLMLingua compress token sequences while conditioning on the query via a smaller language model that scores token salience. Higher compression ratios than sentence filtering; riskier for numeric tables and legal clauses where dropping a single token changes meaning. Best paired with structured sidecars for tables (see RAG table extraction).
Token-budget packing
After compression, a packer greedily adds chunks until a hard budget (e.g. 3,000 tokens) is reached, preferring higher rerank scores. Reorder for attention: place the strongest chunk at the start and end of the bundle to mitigate lost-in-the-middle effects.
Harbor Support refactor
Before compression, the stack was: BM25 + dense fusion → cross-encoder rerank top-20 → stuff top-12 full chunks → GPT-4o answer. Problems were faithfulness (62%), cost (high input tokens), and occasional context overflow on multi-hop tickets.
After:
- Rerank top-20 → LLM relevance filter → 12–14 survivors.
- Parallel LLM extractors with 250-token cap per chunk output.
- Embedding filter as fallback when extractor times out (2 s).
- Citation map: compressed text carries
[chunk_id:span]markers tied to original offsets for audit. - Generator prompt: “Answer only from COMPRESSED_CONTEXT; cite chunk_id.”
Faithfulness 62% → 84%; median input tokens 9,400 → 2,450; extractor timeouts 3.1% with clean fallback. Wrong-doc blending tickets fell 27% to 8%.
Technique decision table
| Technique | Strengths | Weaknesses | Use when |
|---|---|---|---|
| Contextual compression (this guide) | Query-specific noise removal; keeps diverse sources; works on existing index | Extra latency/cost; extraction can drop qualifiers | Large chunks; high recall retrieval; faithfulness issues from noise |
| Cross-encoder rerank only | Strong ordering; single model call | Full chunks still stuffed; middle noise remains | Small chunks (<200 tokens); tight latency budget |
| Parent-child / sentence-window chunking | Precision at index time; less post-processing | Reindex cost; parent context may still be long | Greenfield corpus; stable chunking policy |
| Bigger context window only | Simplest pipeline | Lost-in-the-middle; cost scales linearly | Prototypes; <8K total retrieved text |
| Map-reduce over retrieved set | Handles very large total retrieved mass | Multi-pass latency; synthesis drift | Analytical questions spanning many docs |
| Embedding filter only | Fast; no LLM | Misses logical qualifiers; weak on tables | Latency-critical; chunks with clear sentence boundaries |
Common pitfalls
- Compressing before reranking — extracting from irrelevant chunks wastes compute; rerank or filter first.
- Paraphrasing in extractors — changes numbers, negation, and legal wording; instruct verbatim extraction with ellipses.
- Dropping negation and exceptions — “except on legacy SSO v1” is often one sentence; salience models skip it.
- No citation anchor — compressed text without chunk_id/offset cannot be audited; users lose trust.
- Same budget for all queries — simple FAQs need less context than multi-step incident triage; route by query class.
- Serial per-chunk LLM calls — p95 latency explodes; batch or parallelize with a small model.
- Confusing with tool-result compression — agent observations are a different layer; see tool result compression for ReAct loops.
Production checklist
- Define hard token budget for COMPRESSED_CONTEXT separate from system/history.
- Rerank or relevance-filter before per-chunk extraction.
- Use verbatim-extract prompts with max output tokens per chunk.
- Preserve chunk_id and source URL through compression for citations.
- Parallelize extractors; set timeout with embedding-filter fallback.
- Reorder packed chunks: highest score at start and end of context.
- Log pre/post token counts and dropped chunk IDs for offline eval.
- A/B faithfulness and citation accuracy vs stuff-full-chunks baseline.
- Route table-heavy queries to structured retrieval, not prose extractors.
- Monitor extractor hallucination rate on held-out QA with numeric answers.
Key takeaways
- Contextual compression removes query-irrelevant tokens from retrieved chunks before generation — after recall, before stuff.
- Combine reranking, relevance filters, and LLM extractors; parallelize on a small model to control latency.
- Harbor Support cut stuffed context 74% and raised faithfulness from 62% to 84% with verbatim extractors and citation anchors.
- Do not paraphrase during extraction — preserve numbers, negation, and chunk IDs for audit.
- Pair with parent-child chunking at index time for greenfield corpora; use contextual compression when reindexing is expensive.
Related reading
- RAG retrieval-augmented generation explained — end-to-end retrieve-then-generate architecture
- LLM reranking explained — cross-encoder precision before compression
- LLM lost in the middle explained — why noisy long context hurts answers
- LLM context compression explained — whole-document pipelines vs per-chunk RAG compression