Guide

RAG incremental index updates explained

Harbor Support's policy assistant indexed 38,000 internal articles and PDFs. Product shipped a deductible change on a Monday; legal retired the old clause the same afternoon. The bot kept citing the superseded $500 figure for six weeks. Root cause was not retrieval tuning — it was stale vectors. Their pipeline ran a nightly full re-embed of the entire corpus. Jobs routinely timed out at 40% completion, leaving the index stuck on a snapshot from early March while the docstore held April revisions. Agents escalated to humans who already had the correct answer in Confluence.

The refactor replaced brute-force rebuilds with incremental index updates: content-hash change detection at ingest, document-level upserts and tombstones, chunk-level diffing when only sections moved, and a versioned dual-index cutover when the embedding model changed. Median freshness lag dropped from 19 days to under four hours. This guide covers update taxonomies, change detection strategies, re-embedding scopes, migration patterns, the Harbor Support refactor, a technique decision table versus full nightly rebuilds, pitfalls, and a production checklist — alongside our guides on document ingestion, chunking, and vector databases.

What incremental index updates are

A RAG index is a mapping from chunk IDs to embedding vectors plus metadata filters. Incremental updates modify only the rows affected by source changes instead of re-embedding and re-upserting the entire corpus on every sync.

Incremental maintenance is distinct from:

  • Full rebuild — drop and recreate the index from scratch; simple but expensive and risky under load.
  • Append-only ingestion — new docs land without updating or deleting old versions; duplicates accumulate silently.
  • Query-time freshness — fetching live HTML at answer time without updating vectors; useful as a supplement, not a substitute for index hygiene.

Production systems need all three layers coordinated: detect what changed, update vectors for affected chunks, and tombstone or filter rows that no longer exist in the authoritative source.

Update taxonomy

Document-level upsert

When a file's content hash changes, re-chunk the document, delete prior chunk IDs for that doc_id, embed new chunks, and upsert. Best when chunk boundaries are stable or cheap to recompute. Works well for wiki pages and policy PDFs that change infrequently but completely.

Chunk-level diff

For long documents, compare section hashes or paragraph anchors. Re-embed only changed spans; retain unchanged chunk IDs and vectors. Reduces embedding API cost when 95% of a 200-page manual is static. Requires deterministic chunking — see our chunking strategies guide on parent-child and structure-aware splits.

Tombstones and soft deletes

Never leave orphan vectors when a source document is removed. Write a tombstone record (or set deleted_at metadata) and filter tombstoned rows at query time. Hard deletes are fine in dev; production needs audit trails and rollback.

CDC and event-driven sync

Connect to change streams: S3 object notifications, Confluence webhooks, database binlog, or Git commits. Each event carries doc_id, revision, and operation (create / update / delete). A queue worker performs embed-and-upsert within seconds instead of waiting for cron.

Versioned index migration

When the embedding model or chunk strategy changes, build a parallel index (index_v2), backfill incrementally, run shadow queries comparing recall@k, then flip a routing flag. Never mutate vectors in place across model generations — dimensions and geometry differ.

Change detection mechanics

Reliable increments start with knowing what changed:

  • Content hash — SHA-256 of normalized UTF-8 text after ingestion cleaning. Cheap, deterministic, catches substantive edits. Ignores metadata-only bumps unless you hash metadata separately.
  • Source revision IDs — Confluence version, Google Drive modifiedTime, CMS etag. Trust but verify: some APIs bump version on permission changes without text changes.
  • Structural diff — for HTML and Markdown, diff heading trees to decide which sections need re-chunking. Avoids re-embedding unchanged appendices when someone edits the summary.
  • Lineage registry — a docstore table mapping doc_id to content_hash, chunk_ids[], embed_model, and indexed_at. The registry is the source of truth for sync state; the vector DB is a materialized view.

Harbor Support's registry lived in Postgres. Ingest workers compared incoming hashes to the registry row; unchanged hashes short-circuited before any embedding call. That alone cut nightly embedding spend by 72%.

Harbor Support refactor

Before the refactor, the pipeline looked like this: cron at 02:00 UTC listed all objects in S3, re-parsed every PDF, re-chunked everything, batched embeddings, and upserted into Pinecone. Failures at hour three left a partial write with no transaction boundary — some docs updated, others did not, with no manifest.

The new architecture:

  1. Webhook + hourly sweep — Confluence and SharePoint webhooks enqueue doc_id jobs; an hourly sweep catches missed events.
  2. Ingest gate — run the ingestion pipeline; compute content_hash; exit early if hash matches registry.
  3. Chunk diff — structure-aware chunker emits stable chunk_id = doc_id + section_path + ordinal; diff against prior chunk list; delete removed IDs from the vector index.
  4. Embed batch — only changed chunks enter the embed queue with rate-limit aware workers (P2 priority behind live chat).
  5. Atomic doc commit — upsert new vectors, delete stale chunk IDs, update registry in one transaction per document.
  6. Freshness SLO — alert if any doc_id with source revision newer than registry indexed_at exceeds four hours.

Recall@10 on their golden set held at 81% while embedding cost fell 68%. Escalations citing outdated policy language dropped 44% in the first month.

Technique decision table

Approach Best when Freshness Ops cost
Nightly full rebuild Corpus under 10k chunks, dev/staging 24h lag; fragile on timeout Low code; high compute at scale
Content-hash incremental Wiki/CMS with moderate edit rate Minutes to hours with webhooks Medium; needs registry
Chunk-level diff Large manuals, legal corpora Hours; minimal embed on typo fixes Higher; deterministic chunking required
CDC event stream High-stakes, fast-moving knowledge Seconds to minutes Medium-high; queue infra
Dual-index model migration Embedding model upgrade Days backfill; zero-downtime cutover High temporary storage
Query-time fetch only Tiny corpus, always-live URLs Real-time text; stale vectors remain Low index ops; latency at query

Full rebuilds do not scale linearly — embedding cost grows with corpus size while edit rate often grows sublinearly. Incremental pipelines amortize work across actual changes.

Common pitfalls

  • Re-chunking without re-embedding — new boundaries with old vectors produce nonsense retrieval; always pair chunk changes with embed updates.
  • Orphan vectors after delete — users retrieve retired content; tombstones are mandatory.
  • Partial batch failure — upserting 60% of a document's chunks leaves a frankenstein index; use per-document transactions.
  • Hashing raw PDF bytes — whitespace normalization differs across parsers; hash cleaned text post-ingestion, not file bytes.
  • Ignoring permission changes — a doc may hash the same but become restricted; sync ACL metadata filters separately from content hash.
  • Model swap in place — mixing ada-002 and v3 vectors in one index destroys ranking; version indexes or re-embed entirely.
  • No golden-set regression on sync — a bad ingest deploy poisons thousands of rows before anyone notices; run eval on each indexer release.
  • Embedding cache keyed only by text — model version must be part of the cache key or you serve wrong-dimension vectors.

Production checklist

  • Maintain a docstore registry with content_hash, chunk IDs, embed model, and indexed_at.
  • Short-circuit ingest when content hash is unchanged since last successful index.
  • Emit tombstones or hard-delete chunk IDs when source documents are removed.
  • Wrap per-document upsert + registry update in a transactional boundary.
  • Queue embed jobs with priority below live user traffic.
  • Alert on freshness SLO: source revision ahead of index by more than N hours.
  • Log doc_id, operation, chunk count, and embed latency per sync job.
  • Run recall@k golden-set eval after indexer or chunker deploys.
  • Build parallel indexes for embedding model migrations; cut over with shadow traffic.
  • Include embed model version in vector metadata and embedding cache keys.
  • Sync ACL / visibility metadata independently from content hash.
  • Keep a manual “force reindex” path for incident recovery.

Key takeaways

  • Stale vectors cause confident wrong answers — freshness is a retrieval problem, not just a prompt problem.
  • Content-hash gating and chunk-level diffing cut embed cost while shrinking lag from days to hours.
  • Harbor Support fixed outdated policy citations with webhooks, per-doc transactions, and freshness SLOs.
  • Embedding model changes require versioned indexes — never mix vector generations in one collection.
  • Tombstones, registry discipline, and golden-set eval on every indexer release prevent silent index rot.

Related reading