Guide

LLM RAG document deduplication explained

Harbor Legal’s contract RAG bot answered “What is the liability cap in the MSA template?” with five citations — all pointing to slightly different chunks of the same indemnity clause copied across twelve client addenda. Retrieval recall looked excellent; context diversity was terrible. The model averaged conflicting effective dates because four near-identical passages crowded out the one schedule that actually listed caps by tier. On a 200-query eval set, 38% of top-5 result lists contained at least two chunks with cosine similarity > 0.94 to each other. After adding document and chunk deduplication at ingest and a lightweight post-retrieval diversity pass, redundant hits dropped to 9% and answer accuracy on cap questions rose from 71% to 86% without changing the embedding model.

Deduplication removes or collapses redundant units in a RAG corpus — exact duplicates (same hash), near-duplicates (boilerplate clauses, press-release syndication, wiki mirrors), and overlapping chunks from aggressive splitting. It is distinct from reranking (which reorders) and from cross-encoder relevance (which scores query–passage fit). Clean indexes waste fewer tokens, reduce contradictory evidence in the context window, and make recall@k metrics more meaningful. This guide covers dedup layers, MinHash and embedding similarity, ingest vs query-time strategies, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist — complementing chunking strategy and incremental index updates.

Why duplicates hurt RAG

Vector search returns the k nearest neighbors in embedding space. When a corpus contains many copies of the same paragraph — template MSAs, changelog entries reposted to docs and Confluence, scraped news wires — those copies cluster tightly. A query about liability caps may retrieve five variants of clause 9.2 and zero rows from the fee schedule.

Context waste — LLM input tokens spent on redundant evidence; less room for diverse supporting passages.
False confidence — five citations to “same” text makes answers look well-sourced when evidence is thin.
Contradiction risk — near-duplicates often differ in dates, amounts, or jurisdiction footnotes; the model may blend them.
Inflated recall — eval sets count a hit if any duplicate matches; dedup exposes whether you truly cover distinct facts.
Index bloat — more vectors to store, slower ANN builds, higher embedding ingest cost.

Deduplication is hygiene, not a substitute for good embedding model choice or chunk boundaries. It ensures each slot in top-k carries incremental information.

Dedup layers: exact, near, and overlap

1. Exact duplicate detection

Normalize text (Unicode NFKC, lowercase, collapse whitespace), hash with SHA-256 or xxHash, and keep one canonical copy per hash. Fast, deterministic, catches copy-paste and unchanged template blocks. Store canonical_id and duplicate_of metadata for audit.

2. Near-duplicate detection

Catches paraphrases, minor edits, and syndicated content. Common approaches:

Method	How it works	Best for
MinHash + LSH	Jaccard similarity on character or word shingles	Large text corpora, web crawl dedup
SimHash	Hamming distance on fingerprint bits	Near-duplicate news, legal boilerplate
Embedding cosine	Pairwise or ANN search above threshold τ	Semantic paraphrase, multilingual near-dup
Edit distance / fuzzy match	Levenshtein on normalized spans	Short strings, SKU tables, log lines

3. Chunk overlap dedup

Sliding-window chunking creates overlapping segments; adjacent chunks often share 80% of tokens. Options: increase stride so overlap is informational not redundant; mark parent document and dedupe at retrieval by doc_id; or use late chunking to embed once and slice with less redundancy.

Ingest-time vs query-time dedup

Ingest-time (index hygiene)

Run dedup before embedding and upsert. Collapse duplicates to a canonical chunk; attach aliases in metadata (also_seen_in: [doc_b, doc_c]). Pros: smaller index, faster search, no runtime cost. Cons: must re-run on corpus updates; risk of over-merging distinct clauses that share boilerplate.

Query-time (post-retrieval diversity)

Retrieve k′ > k (e.g. 20), then filter: greedy max marginal relevance (MMR), dedupe by doc_id, or drop any candidate with cosine > τ to an already-selected chunk. Pros: tunable per query, safe for exploratory search. Cons: extra latency, does not shrink index storage.

Production stacks usually combine both: ingest dedup for obvious duplicates, query-time diversity for semantic neighbors. Harbor Legal uses ingest MinHash (shingle size 5, Jaccard > 0.85) plus post-retrieval embedding dedup at τ = 0.92 on top-8 before reranking.

Harbor Legal corpus refactor

Before refactor, Harbor Legal ingested every client MSA PDF independently. Shared templates produced thousands of chunks differing only in party names and effective dates. Symptoms:

Top-5 lists with 3–4 chunks from the same template section.
Reranker scores clustered high on all variants — no diversity signal.
Answers cited wrong effective dates when duplicates carried stale metadata.
Index size 2.4M chunks; estimated 31% near-duplicate by sampling.

Refactor pipeline:

Normalize — strip headers/footers, redact party names to placeholders for template matching.
Exact hash — drop byte-identical chunks; log counts.
MinHash LSH — bucket candidates; merge clusters above Jaccard 0.85; keep newest effective_date as canonical.
Metadata — store cluster_id, alias doc list, and jurisdiction tags on canonical rows only.
Query pass — after ANN top-20, remove pairs with embedding cosine > 0.92 before cross-encoder rerank to top-5.
Reindex — incremental upsert per index update policy; nightly cluster reconciliation job.

Outcomes: index size 2.4M → 1.7M chunks (−29%); redundant top-5 rate 38% → 9%; cap-question accuracy 71% → 86%; p95 retrieval latency −12% (fewer vectors in ANN). Recall@10 on distinct-fact queries unchanged — dedup removed noise, not unique evidence.

Choosing similarity thresholds

Thresholds are corpus-specific. Calibrate on labeled pairs: (duplicate / not duplicate / unsure). Plot precision-recall vs τ for embedding cosine and Jaccard separately.

Legal and policy text — conservative ingest merge (Jaccard 0.90+); query-time τ 0.93–0.96 to preserve jurisdiction variants.
Support KB and FAQs — aggressive ingest dedup (0.80–0.85); users prefer one canonical answer.
News and web crawl — MinHash at 0.70–0.80; syndication varies more than template MSAs.
Code repositories — AST-aware or line-hash dedup; embedding cosine alone merges different functions with similar comments.

Always hold out a adversarial set of legitimately similar but distinct clauses (e.g. mutual vs one-way indemnity). Over-merging is harder to detect than under-merging in production.

Technique decision table

Scenario	Prefer	Avoid
Template-heavy corpora (legal, compliance)	Ingest MinHash + canonical metadata + query embedding dedup	Reranking alone without diversity pass
Real-time news ingestion	SimHash or MinHash on ingest; short TTL on clusters	Permanent merge of stories still updating
Multilingual mirrors	Cross-lingual embedding similarity with human spot-check	English-only shingle MinHash
Small curated KB (<50k chunks)	Query-time MMR or doc_id dedup only	Heavy offline clustering jobs
Overlapping chunk windows	Increase stride or parent-child retrieval	Storing every overlap as independent evidence
Versioned documents (wikis, git docs)	Dedup by content hash; keep version lineage in metadata	Deleting old versions with no redirect to canonical
Latency-sensitive chat (<300 ms retrieval)	Ingest dedup only; query pass on top-8 max	Pairwise O(k²) embedding compare on k=50

Pair dedup with hybrid search when duplicates span lexical variants BM25 would separate — fuse ranks after dedup, not before.

Common pitfalls

Dedup after reranking only — reranker may score all duplicates highly; diversity pass must run on a wider candidate pool.
Ignoring metadata collisions — merging chunks with different effective_date or ACL scopes causes authorization bugs.
Global τ from another domain — 0.92 cosine worked for Harbor Legal; support FAQs may need 0.88 or 0.95.
Deleting without alias pointers — audit trails break; store canonical_chunk_id on suppressed rows.
Chunk-level dedup on structured tables — row-wise similarity merges distinct SKUs; dedup at table or row-key granularity.
Assuming dedup fixes bad chunking — oversized chunks still pack unrelated facts; fix splits first.
No monitoring — track redundant-rate in top-k weekly; spikes indicate a new syndicated source or broken ingest.
Skipping eval on distinct-fact queries — dedup can hide the only chunk that mentions a rare exception; test tail queries.

Production checklist

Label 200+ chunk pairs (duplicate / distinct / unsure) for threshold tuning.
Implement exact hash dedup as the first ingest stage (cheap win).
Choose MinHash, SimHash, or embedding clustering based on corpus type.
Preserve canonical IDs and alias lists; never hard-delete without lineage.
Respect ACL and tenant boundaries — never merge across tenants.
Add query-time diversity when k < 10 and corpora are template-heavy.
Log redundant-rate and cluster size distribution to observability.
Re-run cluster reconciliation on incremental ingests nightly or on threshold change.
Measure recall@k on deduplicated eval sets, not raw index hits.
Document override process when legal/compliance blocks automatic merge.

Key takeaways

Duplicate and near-duplicate chunks waste context, inflate citation confidence, and can introduce contradictory dates or amounts.
Exact hash, MinHash/SimHash, and embedding similarity address different duplicate classes — most pipelines use more than one.
Ingest dedup shrinks indexes; query-time diversity improves top-k without reindexing — combine both for template-heavy corpora.
Harbor Legal cut redundant top-5 hits from 38% to 9% and raised cap-question accuracy from 71% to 86% after MinHash ingest plus embedding post-filter.
Calibrate thresholds on domain-labeled pairs; over-merging distinct clauses is worse than leaving minor redundancy.