Guide
LLM RAG document deduplication explained
Harbor Legal’s contract RAG bot answered “What is the liability cap in the MSA template?” with five citations — all pointing to slightly different chunks of the same indemnity clause copied across twelve client addenda. Retrieval recall looked excellent; context diversity was terrible. The model averaged conflicting effective dates because four near-identical passages crowded out the one schedule that actually listed caps by tier. On a 200-query eval set, 38% of top-5 result lists contained at least two chunks with cosine similarity > 0.94 to each other. After adding document and chunk deduplication at ingest and a lightweight post-retrieval diversity pass, redundant hits dropped to 9% and answer accuracy on cap questions rose from 71% to 86% without changing the embedding model.
Deduplication removes or collapses redundant units in a RAG corpus — exact duplicates (same hash), near-duplicates (boilerplate clauses, press-release syndication, wiki mirrors), and overlapping chunks from aggressive splitting. It is distinct from reranking (which reorders) and from cross-encoder relevance (which scores query–passage fit). Clean indexes waste fewer tokens, reduce contradictory evidence in the context window, and make recall@k metrics more meaningful. This guide covers dedup layers, MinHash and embedding similarity, ingest vs query-time strategies, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist — complementing chunking strategy and incremental index updates.
Why duplicates hurt RAG
Vector search returns the k nearest neighbors in embedding space. When a corpus contains many copies of the same paragraph — template MSAs, changelog entries reposted to docs and Confluence, scraped news wires — those copies cluster tightly. A query about liability caps may retrieve five variants of clause 9.2 and zero rows from the fee schedule.
- Context waste — LLM input tokens spent on redundant evidence; less room for diverse supporting passages.
- False confidence — five citations to “same” text makes answers look well-sourced when evidence is thin.
- Contradiction risk — near-duplicates often differ in dates, amounts, or jurisdiction footnotes; the model may blend them.
- Inflated recall — eval sets count a hit if any duplicate matches; dedup exposes whether you truly cover distinct facts.
- Index bloat — more vectors to store, slower ANN builds, higher embedding ingest cost.
Deduplication is hygiene, not a substitute for good embedding model choice or chunk boundaries. It ensures each slot in top-k carries incremental information.
Dedup layers: exact, near, and overlap
1. Exact duplicate detection
Normalize text (Unicode NFKC, lowercase, collapse whitespace), hash with
SHA-256 or xxHash, and keep one canonical copy per hash. Fast, deterministic,
catches copy-paste and unchanged template blocks. Store
canonical_id and duplicate_of metadata for audit.
2. Near-duplicate detection
Catches paraphrases, minor edits, and syndicated content. Common approaches:
| Method | How it works | Best for |
|---|---|---|
| MinHash + LSH | Jaccard similarity on character or word shingles | Large text corpora, web crawl dedup |
| SimHash | Hamming distance on fingerprint bits | Near-duplicate news, legal boilerplate |
| Embedding cosine | Pairwise or ANN search above threshold τ | Semantic paraphrase, multilingual near-dup |
| Edit distance / fuzzy match | Levenshtein on normalized spans | Short strings, SKU tables, log lines |
3. Chunk overlap dedup
Sliding-window
chunking
creates overlapping segments; adjacent chunks often share 80% of tokens.
Options: increase stride so overlap is informational not redundant; mark
parent document and dedupe at retrieval by doc_id; or use
late chunking
to embed once and slice with less redundancy.
Ingest-time vs query-time dedup
Ingest-time (index hygiene)
Run dedup before embedding and upsert. Collapse duplicates to a canonical
chunk; attach aliases in metadata (also_seen_in: [doc_b, doc_c]).
Pros: smaller index, faster search, no runtime cost. Cons: must re-run on
corpus updates; risk of over-merging distinct clauses that share boilerplate.
Query-time (post-retrieval diversity)
Retrieve k′ > k (e.g. 20), then filter:
greedy max marginal relevance (MMR), dedupe by doc_id, or drop
any candidate with cosine > τ to an already-selected chunk. Pros: tunable
per query, safe for exploratory search. Cons: extra latency, does not shrink
index storage.
Production stacks usually combine both: ingest dedup for obvious duplicates, query-time diversity for semantic neighbors. Harbor Legal uses ingest MinHash (shingle size 5, Jaccard > 0.85) plus post-retrieval embedding dedup at τ = 0.92 on top-8 before reranking.
Harbor Legal corpus refactor
Before refactor, Harbor Legal ingested every client MSA PDF independently. Shared templates produced thousands of chunks differing only in party names and effective dates. Symptoms:
- Top-5 lists with 3–4 chunks from the same template section.
- Reranker scores clustered high on all variants — no diversity signal.
- Answers cited wrong effective dates when duplicates carried stale metadata.
- Index size 2.4M chunks; estimated 31% near-duplicate by sampling.
Refactor pipeline:
- Normalize — strip headers/footers, redact party names to placeholders for template matching.
- Exact hash — drop byte-identical chunks; log counts.
- MinHash LSH — bucket candidates; merge clusters above
Jaccard 0.85; keep newest
effective_dateas canonical. - Metadata — store
cluster_id, alias doc list, and jurisdiction tags on canonical rows only. - Query pass — after ANN top-20, remove pairs with embedding cosine > 0.92 before cross-encoder rerank to top-5.
- Reindex — incremental upsert per index update policy; nightly cluster reconciliation job.
Outcomes: index size 2.4M → 1.7M chunks (−29%); redundant top-5 rate 38% → 9%; cap-question accuracy 71% → 86%; p95 retrieval latency −12% (fewer vectors in ANN). Recall@10 on distinct-fact queries unchanged — dedup removed noise, not unique evidence.
Choosing similarity thresholds
Thresholds are corpus-specific. Calibrate on labeled pairs: (duplicate / not duplicate / unsure). Plot precision-recall vs τ for embedding cosine and Jaccard separately.
- Legal and policy text — conservative ingest merge (Jaccard 0.90+); query-time τ 0.93–0.96 to preserve jurisdiction variants.
- Support KB and FAQs — aggressive ingest dedup (0.80–0.85); users prefer one canonical answer.
- News and web crawl — MinHash at 0.70–0.80; syndication varies more than template MSAs.
- Code repositories — AST-aware or line-hash dedup; embedding cosine alone merges different functions with similar comments.
Always hold out a adversarial set of legitimately similar but distinct clauses (e.g. mutual vs one-way indemnity). Over-merging is harder to detect than under-merging in production.
Technique decision table
| Scenario | Prefer | Avoid |
|---|---|---|
| Template-heavy corpora (legal, compliance) | Ingest MinHash + canonical metadata + query embedding dedup | Reranking alone without diversity pass |
| Real-time news ingestion | SimHash or MinHash on ingest; short TTL on clusters | Permanent merge of stories still updating |
| Multilingual mirrors | Cross-lingual embedding similarity with human spot-check | English-only shingle MinHash |
| Small curated KB (<50k chunks) | Query-time MMR or doc_id dedup only | Heavy offline clustering jobs |
| Overlapping chunk windows | Increase stride or parent-child retrieval | Storing every overlap as independent evidence |
| Versioned documents (wikis, git docs) | Dedup by content hash; keep version lineage in metadata | Deleting old versions with no redirect to canonical |
| Latency-sensitive chat (<300 ms retrieval) | Ingest dedup only; query pass on top-8 max | Pairwise O(k²) embedding compare on k=50 |
Pair dedup with hybrid search when duplicates span lexical variants BM25 would separate — fuse ranks after dedup, not before.
Common pitfalls
- Dedup after reranking only — reranker may score all duplicates highly; diversity pass must run on a wider candidate pool.
- Ignoring metadata collisions — merging chunks with
different
effective_dateor ACL scopes causes authorization bugs. - Global τ from another domain — 0.92 cosine worked for Harbor Legal; support FAQs may need 0.88 or 0.95.
- Deleting without alias pointers — audit trails break;
store
canonical_chunk_idon suppressed rows. - Chunk-level dedup on structured tables — row-wise similarity merges distinct SKUs; dedup at table or row-key granularity.
- Assuming dedup fixes bad chunking — oversized chunks still pack unrelated facts; fix splits first.
- No monitoring — track redundant-rate in top-k weekly; spikes indicate a new syndicated source or broken ingest.
- Skipping eval on distinct-fact queries — dedup can hide the only chunk that mentions a rare exception; test tail queries.
Production checklist
- Label 200+ chunk pairs (duplicate / distinct / unsure) for threshold tuning.
- Implement exact hash dedup as the first ingest stage (cheap win).
- Choose MinHash, SimHash, or embedding clustering based on corpus type.
- Preserve canonical IDs and alias lists; never hard-delete without lineage.
- Respect ACL and tenant boundaries — never merge across tenants.
- Add query-time diversity when k < 10 and corpora are template-heavy.
- Log redundant-rate and cluster size distribution to observability.
- Re-run cluster reconciliation on incremental ingests nightly or on threshold change.
- Measure recall@k on deduplicated eval sets, not raw index hits.
- Document override process when legal/compliance blocks automatic merge.
Key takeaways
- Duplicate and near-duplicate chunks waste context, inflate citation confidence, and can introduce contradictory dates or amounts.
- Exact hash, MinHash/SimHash, and embedding similarity address different duplicate classes — most pipelines use more than one.
- Ingest dedup shrinks indexes; query-time diversity improves top-k without reindexing — combine both for template-heavy corpora.
- Harbor Legal cut redundant top-5 hits from 38% to 9% and raised cap-question accuracy from 71% to 86% after MinHash ingest plus embedding post-filter.
- Calibrate thresholds on domain-labeled pairs; over-merging distinct clauses is worse than leaving minor redundancy.
Related reading
- RAG chunking strategies explained — overlap, stride, and parent-child patterns that affect duplicate rate
- Vector databases explained — ANN indexes, metadata filters, and storage implications of smaller corpora
- LLM embeddings explained — why semantically similar passages cluster and how cosine thresholds behave
- RAG incremental index updates explained — reindexing and cluster reconciliation when documents change