Guide

LLM contextual compression for RAG explained

Harbor Support's RAG bot retrieved the right runbooks — but each chunk was 600–900 tokens of full procedure text, boilerplate warnings, and unrelated troubleshooting branches. Stuffing the top-12 hits into a 32K context window left 9,400 tokens of noise. The generator cited step three from the wrong doc, blended two SAML guides, and scored 62% on attorney-reviewed faithfulness. Engineers added a contextual compression stage after hybrid retrieval: every candidate chunk was filtered and extracted with the user question in view before assembly. Median stuffed context fell 74% (9,400 to 2,450 tokens); faithfulness rose to 84%; p95 latency grew only 180 ms because compression ran on a small model in parallel.

Contextual compression is post-retrieval, pre-generation pruning: unlike indexing-time parent-child chunking or whole-document context compression, it operates on already retrieved passages and uses the live query to decide what survives. This guide covers embedding filters, LLM extractors, rerank-then-compress ordering, citation preservation, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Where contextual compression sits in RAG

A standard RAG pipeline is retrieve → stuff → generate. Retrieval optimizes recall (get relevant docs into the candidate pool). Reranking optimizes precision (order candidates). Neither guarantees that every token in a chunk helps answer this specific question. Long chunks improve recall at index time but punish generation: models lose signal in the middle, hallucinate when distracted by adjacent procedures, and burn budget that could hold more diverse sources.

Contextual compression is a fourth stage:

Retrieve — hybrid BM25 + dense (optionally HyDE) returns top-k chunks.
Rerank (optional) — cross-encoder scores query–passage pairs.
Compress contextually — per chunk, drop or extract only query-relevant spans.
Assemble and generate — pack compressed chunks within token budget; answer with citations.

The key word is contextual: the same chunk about “VPN troubleshooting” might compress to three sentences on “split-tunnel routes” and eight sentences on “certificate renewal errors.”

Compression techniques

Embedding similarity filter

The cheapest filter embeds the user query and each sentence (or fixed sub-span) inside a retrieved chunk, keeps spans above a cosine threshold, and concatenates survivors. No extra LLM call; latency is one embedding batch per chunk. Works when chunks are structurally similar but only one paragraph matches. Weak when relevance requires cross-sentence reasoning (“unless clause B applies, ignore step 4”).

LLM chain extractor

A small model receives the question plus the full chunk and returns only excerpts that help answer it — often with explicit “extract verbatim, do not paraphrase” instructions to preserve citations. Harbor Support used a 3B extractor on chunks already reranked by a cross-encoder; extractors ran concurrently with a 4-wide worker pool. Extraction prompts included the chunk ID and source URL so downstream citation assembly stayed stable.

LLM relevance filter (drop whole chunks)

Before per-chunk extraction, a binary classifier pass can discard entire chunks: “Does this passage contain information needed to answer the question? Yes/No.” Cheaper than extract-then-stuff when many reranked hits are marginally related. Harbor dropped 38% of post-rerank chunks this way, saving extractor calls.

Learned prompt compressors

Tools like LLMLingua compress token sequences while conditioning on the query via a smaller language model that scores token salience. Higher compression ratios than sentence filtering; riskier for numeric tables and legal clauses where dropping a single token changes meaning. Best paired with structured sidecars for tables (see RAG table extraction).

Token-budget packing

After compression, a packer greedily adds chunks until a hard budget (e.g. 3,000 tokens) is reached, preferring higher rerank scores. Reorder for attention: place the strongest chunk at the start and end of the bundle to mitigate lost-in-the-middle effects.

Harbor Support refactor

Before compression, the stack was: BM25 + dense fusion → cross-encoder rerank top-20 → stuff top-12 full chunks → GPT-4o answer. Problems were faithfulness (62%), cost (high input tokens), and occasional context overflow on multi-hop tickets.

After:

Rerank top-20 → LLM relevance filter → 12–14 survivors.
Parallel LLM extractors with 250-token cap per chunk output.
Embedding filter as fallback when extractor times out (2 s).
Citation map: compressed text carries [chunk_id:span] markers tied to original offsets for audit.
Generator prompt: “Answer only from COMPRESSED_CONTEXT; cite chunk_id.”

Faithfulness 62% → 84%; median input tokens 9,400 → 2,450; extractor timeouts 3.1% with clean fallback. Wrong-doc blending tickets fell 27% to 8%.

Technique decision table

Technique	Strengths	Weaknesses	Use when
Contextual compression (this guide)	Query-specific noise removal; keeps diverse sources; works on existing index	Extra latency/cost; extraction can drop qualifiers	Large chunks; high recall retrieval; faithfulness issues from noise
Cross-encoder rerank only	Strong ordering; single model call	Full chunks still stuffed; middle noise remains	Small chunks (<200 tokens); tight latency budget
Parent-child / sentence-window chunking	Precision at index time; less post-processing	Reindex cost; parent context may still be long	Greenfield corpus; stable chunking policy
Bigger context window only	Simplest pipeline	Lost-in-the-middle; cost scales linearly	Prototypes; <8K total retrieved text
Map-reduce over retrieved set	Handles very large total retrieved mass	Multi-pass latency; synthesis drift	Analytical questions spanning many docs
Embedding filter only	Fast; no LLM	Misses logical qualifiers; weak on tables	Latency-critical; chunks with clear sentence boundaries

Common pitfalls

Compressing before reranking — extracting from irrelevant chunks wastes compute; rerank or filter first.
Paraphrasing in extractors — changes numbers, negation, and legal wording; instruct verbatim extraction with ellipses.
Dropping negation and exceptions — “except on legacy SSO v1” is often one sentence; salience models skip it.
No citation anchor — compressed text without chunk_id/offset cannot be audited; users lose trust.
Same budget for all queries — simple FAQs need less context than multi-step incident triage; route by query class.
Serial per-chunk LLM calls — p95 latency explodes; batch or parallelize with a small model.
Confusing with tool-result compression — agent observations are a different layer; see tool result compression for ReAct loops.

Production checklist

Define hard token budget for COMPRESSED_CONTEXT separate from system/history.
Rerank or relevance-filter before per-chunk extraction.
Use verbatim-extract prompts with max output tokens per chunk.
Preserve chunk_id and source URL through compression for citations.
Parallelize extractors; set timeout with embedding-filter fallback.
Reorder packed chunks: highest score at start and end of context.
Log pre/post token counts and dropped chunk IDs for offline eval.
A/B faithfulness and citation accuracy vs stuff-full-chunks baseline.
Route table-heavy queries to structured retrieval, not prose extractors.
Monitor extractor hallucination rate on held-out QA with numeric answers.

Key takeaways

Contextual compression removes query-irrelevant tokens from retrieved chunks before generation — after recall, before stuff.
Combine reranking, relevance filters, and LLM extractors; parallelize on a small model to control latency.
Harbor Support cut stuffed context 74% and raised faithfulness from 62% to 84% with verbatim extractors and citation anchors.
Do not paraphrase during extraction — preserve numbers, negation, and chunk IDs for audit.
Pair with parent-child chunking at index time for greenfield corpora; use contextual compression when reindexing is expensive.