Guide
RAG incremental index updates explained
Harbor Support's policy assistant indexed 38,000 internal articles and PDFs. Product shipped a deductible change on a Monday; legal retired the old clause the same afternoon. The bot kept citing the superseded $500 figure for six weeks. Root cause was not retrieval tuning — it was stale vectors. Their pipeline ran a nightly full re-embed of the entire corpus. Jobs routinely timed out at 40% completion, leaving the index stuck on a snapshot from early March while the docstore held April revisions. Agents escalated to humans who already had the correct answer in Confluence.
The refactor replaced brute-force rebuilds with incremental index updates: content-hash change detection at ingest, document-level upserts and tombstones, chunk-level diffing when only sections moved, and a versioned dual-index cutover when the embedding model changed. Median freshness lag dropped from 19 days to under four hours. This guide covers update taxonomies, change detection strategies, re-embedding scopes, migration patterns, the Harbor Support refactor, a technique decision table versus full nightly rebuilds, pitfalls, and a production checklist — alongside our guides on document ingestion, chunking, and vector databases.
What incremental index updates are
A RAG index is a mapping from chunk IDs to embedding vectors plus metadata filters. Incremental updates modify only the rows affected by source changes instead of re-embedding and re-upserting the entire corpus on every sync.
Incremental maintenance is distinct from:
- Full rebuild — drop and recreate the index from scratch; simple but expensive and risky under load.
- Append-only ingestion — new docs land without updating or deleting old versions; duplicates accumulate silently.
- Query-time freshness — fetching live HTML at answer time without updating vectors; useful as a supplement, not a substitute for index hygiene.
Production systems need all three layers coordinated: detect what changed, update vectors for affected chunks, and tombstone or filter rows that no longer exist in the authoritative source.
Update taxonomy
Document-level upsert
When a file's content hash changes, re-chunk the document, delete prior chunk
IDs for that doc_id, embed new chunks, and upsert. Best when chunk
boundaries are stable or cheap to recompute. Works well for wiki pages and policy
PDFs that change infrequently but completely.
Chunk-level diff
For long documents, compare section hashes or paragraph anchors. Re-embed only changed spans; retain unchanged chunk IDs and vectors. Reduces embedding API cost when 95% of a 200-page manual is static. Requires deterministic chunking — see our chunking strategies guide on parent-child and structure-aware splits.
Tombstones and soft deletes
Never leave orphan vectors when a source document is removed. Write a tombstone
record (or set deleted_at metadata) and filter tombstoned rows at
query time. Hard deletes are fine in dev; production needs audit trails and
rollback.
CDC and event-driven sync
Connect to change streams: S3 object notifications, Confluence webhooks, database
binlog, or Git commits. Each event carries doc_id, revision, and
operation (create / update / delete). A queue worker performs embed-and-upsert
within seconds instead of waiting for cron.
Versioned index migration
When the embedding model or chunk strategy changes, build a parallel index
(index_v2), backfill incrementally, run shadow queries comparing
recall@k, then flip a routing flag. Never mutate vectors in place across model
generations — dimensions and geometry differ.
Change detection mechanics
Reliable increments start with knowing what changed:
- Content hash — SHA-256 of normalized UTF-8 text after ingestion cleaning. Cheap, deterministic, catches substantive edits. Ignores metadata-only bumps unless you hash metadata separately.
- Source revision IDs — Confluence
version, Google DrivemodifiedTime, CMSetag. Trust but verify: some APIs bump version on permission changes without text changes. - Structural diff — for HTML and Markdown, diff heading trees to decide which sections need re-chunking. Avoids re-embedding unchanged appendices when someone edits the summary.
- Lineage registry — a docstore table mapping
doc_idtocontent_hash,chunk_ids[],embed_model, andindexed_at. The registry is the source of truth for sync state; the vector DB is a materialized view.
Harbor Support's registry lived in Postgres. Ingest workers compared incoming hashes to the registry row; unchanged hashes short-circuited before any embedding call. That alone cut nightly embedding spend by 72%.
Harbor Support refactor
Before the refactor, the pipeline looked like this: cron at 02:00 UTC listed all objects in S3, re-parsed every PDF, re-chunked everything, batched embeddings, and upserted into Pinecone. Failures at hour three left a partial write with no transaction boundary — some docs updated, others did not, with no manifest.
The new architecture:
- Webhook + hourly sweep — Confluence and SharePoint
webhooks enqueue
doc_idjobs; an hourly sweep catches missed events. - Ingest gate — run the
ingestion pipeline;
compute
content_hash; exit early if hash matches registry. - Chunk diff — structure-aware chunker emits stable
chunk_id = doc_id + section_path + ordinal; diff against prior chunk list; delete removed IDs from the vector index. - Embed batch — only changed chunks enter the embed queue with rate-limit aware workers (P2 priority behind live chat).
- Atomic doc commit — upsert new vectors, delete stale chunk IDs, update registry in one transaction per document.
- Freshness SLO — alert if any
doc_idwith source revision newer than registryindexed_atexceeds four hours.
Recall@10 on their golden set held at 81% while embedding cost fell 68%. Escalations citing outdated policy language dropped 44% in the first month.
Technique decision table
| Approach | Best when | Freshness | Ops cost |
|---|---|---|---|
| Nightly full rebuild | Corpus under 10k chunks, dev/staging | 24h lag; fragile on timeout | Low code; high compute at scale |
| Content-hash incremental | Wiki/CMS with moderate edit rate | Minutes to hours with webhooks | Medium; needs registry |
| Chunk-level diff | Large manuals, legal corpora | Hours; minimal embed on typo fixes | Higher; deterministic chunking required |
| CDC event stream | High-stakes, fast-moving knowledge | Seconds to minutes | Medium-high; queue infra |
| Dual-index model migration | Embedding model upgrade | Days backfill; zero-downtime cutover | High temporary storage |
| Query-time fetch only | Tiny corpus, always-live URLs | Real-time text; stale vectors remain | Low index ops; latency at query |
Full rebuilds do not scale linearly — embedding cost grows with corpus size while edit rate often grows sublinearly. Incremental pipelines amortize work across actual changes.
Common pitfalls
- Re-chunking without re-embedding — new boundaries with old vectors produce nonsense retrieval; always pair chunk changes with embed updates.
- Orphan vectors after delete — users retrieve retired content; tombstones are mandatory.
- Partial batch failure — upserting 60% of a document's chunks leaves a frankenstein index; use per-document transactions.
- Hashing raw PDF bytes — whitespace normalization differs across parsers; hash cleaned text post-ingestion, not file bytes.
- Ignoring permission changes — a doc may hash the same but become restricted; sync ACL metadata filters separately from content hash.
- Model swap in place — mixing ada-002 and v3 vectors in one index destroys ranking; version indexes or re-embed entirely.
- No golden-set regression on sync — a bad ingest deploy poisons thousands of rows before anyone notices; run eval on each indexer release.
- Embedding cache keyed only by text — model version must be part of the cache key or you serve wrong-dimension vectors.
Production checklist
- Maintain a docstore registry with
content_hash, chunk IDs, embed model, andindexed_at. - Short-circuit ingest when content hash is unchanged since last successful index.
- Emit tombstones or hard-delete chunk IDs when source documents are removed.
- Wrap per-document upsert + registry update in a transactional boundary.
- Queue embed jobs with priority below live user traffic.
- Alert on freshness SLO: source revision ahead of index by more than N hours.
- Log
doc_id, operation, chunk count, and embed latency per sync job. - Run recall@k golden-set eval after indexer or chunker deploys.
- Build parallel indexes for embedding model migrations; cut over with shadow traffic.
- Include embed model version in vector metadata and embedding cache keys.
- Sync ACL / visibility metadata independently from content hash.
- Keep a manual “force reindex” path for incident recovery.
Key takeaways
- Stale vectors cause confident wrong answers — freshness is a retrieval problem, not just a prompt problem.
- Content-hash gating and chunk-level diffing cut embed cost while shrinking lag from days to hours.
- Harbor Support fixed outdated policy citations with webhooks, per-doc transactions, and freshness SLOs.
- Embedding model changes require versioned indexes — never mix vector generations in one collection.
- Tombstones, registry discipline, and golden-set eval on every indexer release prevent silent index rot.
Related reading
- RAG document ingestion explained — parsing, OCR and clean text before hashing
- RAG chunking strategies explained — stable boundaries for chunk-level diff
- RAG evaluation explained — golden sets that catch freshness regressions
- Vector databases explained — upsert, delete and metadata filtering semantics