Guide

LLM embedding drift monitoring explained

Harbor Support's RAG stack indexed 22,000 runbooks with a domain-fine-tuned bi-encoder that held 89% recall@10 on a 640-query golden set. Ingest pipelines reported green: every document hash matched its vector row, tombstones were current, and nightly incremental updates completed in under two hours. Then a vendor pushed a silent patch to the hosted embedding API — same model name, revised weights. No re-embed ran because nothing in the docstore changed. Within a week, acronym-heavy queries (“SSO SAML timeout 502,” “GDPR DSAR export SLA”) fell to 74% recall@10 while generic FAQs still scored 87%. Support escalations rose 22% before anyone noticed the index and encoder were out of sync.

That failure mode is embedding drift: the vector space your retrieval layer assumes no longer matches the space your encoder produces today. Drift is distinct from stale content (fixed by ingest sync) and from bad chunking (fixed by re-chunking). It is a silent regression in the geometry of similarity search. This guide explains drift types, monitoring signals, golden-set regression design, remediation playbooks, the Harbor Support refactor, a technique decision table versus ingest-only freshness checks, pitfalls, and a production checklist — alongside our guides on embedding fine-tuning, RAG evaluation, and embeddings fundamentals.

Three kinds of embedding drift

Production teams often conflate “stale index” with “drift.” They are related but remedied differently.

Drift type	What moves	Typical trigger	Fix
Model drift	Encoder weights or API version	Fine-tune redeploy, vendor patch, quantization change	Full corpus re-embed; dual-index cutover
Query drift	Distribution of user questions	New product launch, seasonality, language mix shift	Expand golden set; re-fine-tune; add sparse leg
Corpus drift	Document topics without hash changes	Metadata-only edits, synonym rewrites, template swaps	Re-chunk or re-embed affected namespaces

Incremental index updates solve corpus freshness when source bytes change. They do not detect model drift when the docstore is unchanged but the encoder moved. You need explicit embedding-space monitors on top of ingest health.

Why drift breaks RAG silently

Vector search has no obvious error surface. A wrong chunk still returns a high cosine score — just not the right neighbor. Downstream symptoms are subtle:

Faithful-sounding wrong answers. The LLM cites retrieved text accurately but from an outdated or tangential article.
Rising human escalations while automated answer rate stays flat.
Segment skew. New product queries fail; legacy FAQs still work.
Reranker masking. A strong cross-encoder hides first-stage regression until latency budget forces you to drop it.
Hybrid imbalance. BM25 or SPLADE legs compensate temporarily, then query drift exposes the dense leg.

Without scheduled regression on a fixed golden query set, teams discover drift from user complaints — weeks after the encoder changed.

Monitoring signals that actually fire

Pick two or three signals with clear thresholds. More dashboards rarely help if nobody owns the on-call rotation.

Golden-set recall@k (primary)

A held-out set of 200–2,000 query–document relevance judgments, stratified by query type (acronyms, SKUs, procedural how-tos, troubleshooting). Run nightly against production index + current encoder. Alert when recall@10 or MRR drops more than 2–3 absolute points versus a seven-day rolling baseline. Store encoder version, index build ID, and chunk schema version alongside each run so you can bisect regressions.

Centroid and score-distribution shift

Sample 1–5% of indexed vectors weekly. Re-encode the same chunk texts with the live encoder and measure mean L2 distance between stored and fresh vectors. A sudden jump across namespaces indicates model drift even before golden-set labels fail. Track retrieval score histograms (top-1 cosine) per query cluster; a flattening distribution often precedes recall drops.

Neighbor stability

For a fixed probe set of chunk IDs, compare top-20 neighbor lists week over week. Jaccard similarity below 0.85 on stable corpora signals space movement. Cheap to run and catches vendor patches that do not bump version strings.

Online proxy metrics

Pair offline monitors with production signals: thumbs-down rate on RAG answers, “no relevant doc found” fallback rate, and LLM-as-judge faithfulness on a sampled slice. Online metrics lag and are noisy; use them to confirm offline alerts, not as the sole detector.

Golden-set design for drift detection

A golden set tuned only to launch-week queries goes stale as fast as your product. Build it like a regression suite, not a one-time benchmark.

Stratify labels by query intent, language, and difficulty (exact match vs paraphrase vs multi-hop).
Include negative controls — queries that should retrieve nothing — to catch false-positive inflation.
Refresh quarterly from production logs with human review; retire labels tied to deleted documents.
Version the set (golden-v2026-q2) so model comparisons stay apples-to-apples.
Pin encoder config in CI: batch size, max sequence length, normalization, and API endpoint must match production.

After embedding fine-tuning, re-baseline golden metrics before cutover. The new space is intentionally different; compare against the new baseline, not pre-train recall.

Remediation playbook

When an alert fires, triage in this order:

Confirm encoder version. Did hosting vendor, container image, or adapter weights change? Roll back if unintended.
Check index build ID. Partial re-embed jobs leave mixed vector generations in one collection — worse than uniform drift.
Segment the golden set. If only acronym queries fail, query drift or fine-tune overfit may dominate; expand labels or add a sparse leg.
Re-embed or dual-write. Model drift requires full corpus re-embedding into a shadow index, eval on golden set, then atomic alias swap (same pattern as incremental migration in our index update guide).
Re-fine-tune if query drift persists after re-embed — the base encoder may no longer align with your domain vocabulary.

Document every incident with encoder hash, index version, and recall delta. That audit trail prevents repeating the same silent API upgrade mistake.

Technique decision table

Approach	Best when	Weak when
Ingest hash sync only	Static encoder, stable query mix, small corpus	Hosted APIs, fine-tuned encoders, fast-moving products
Golden-set nightly recall	Any production RAG with labeled queries	No labeled data; need bootstrap from click logs first
Centroid / neighbor stability	Catch vendor patches before labels fail	High churn corpora where neighbors should change
Online thumbs-down alerts	High-traffic assistant with feedback UI	Low volume; noisy user sentiment
Scheduled full re-embed	Quarterly hygiene regardless of alerts	Large corpora without shadow-index budget
Freeze encoder version (self-host)	Compliance-sensitive, reproducibility required	You want upstream model improvements without ops burden

Harbor Support refactor

After the silent API patch incident, Harbor added an embedding observability layer:

Nightly golden-set job (640 queries, 8 strata) with Slack alert on >2pt recall@10 drop.
Weekly 2% centroid re-encode sample; alert if mean L2 > 0.08 vs prior week.
Encoder version pinned in deployment manifest; API calls include model_revision header logged to ClickHouse.
Shadow index pipeline: new encoder builds parallel collection before alias swap.
Quarterly golden-set refresh from escalated tickets with dual annotator review.

Mean time to detect encoder regression fell from ~18 days (escalation-driven) to under 6 hours. One subsequent vendor update was caught by centroid shift the night before golden recall breached threshold. Post-refactor acronym recall@10 stabilized at 88–90% across three encoder minor versions.

Common pitfalls

Confusing ingest green with retrieval healthy. Hash sync does not catch model drift.
Golden set too small or homogeneous. Single-segment benchmarks miss acronym and SKU regressions.
Comparing across encoder versions without re-embed. Mixed vector generations poison both metrics and user results.
Alert fatigue. Too many thresholds; pick recall@k + one geometric signal.
Ignoring query drift. Re-embed fixes model drift but not a product pivot toward new vocabulary.
Reranker as crutch. Masks first-stage decay until cost or latency forces removal.
No encoder rollback plan. Shadow index work is wasted if prod cannot revert alias in one step.
Skipping normalization checks. L2 vs cosine, different max lengths — silent score scale changes look like drift.

Production checklist

Build stratified golden query set with human relevance labels (200+ minimum).
Schedule nightly recall@k and MRR against production index + live encoder.
Log encoder version, index build ID, and chunk schema on every eval run.
Alert on >2–3pt recall drop vs seven-day baseline.
Weekly centroid re-encode sample on 1–5% of chunks.
Track neighbor-list Jaccard stability on fixed probe chunks.
Pin hosted API model revision; block deploy if header changes unreviewed.
Maintain shadow-index pipeline for full re-embed before alias cutover.
Refresh golden set quarterly from production escalations.
Segment dashboards by query stratum (acronym, SKU, procedural, etc.).
Pair offline alerts with sampled online faithfulness checks.
Document incident playbooks: rollback encoder vs re-embed vs re-fine-tune.

Key takeaways

Embedding drift is vector-space misalignment — not the same as stale documents.
Golden-set recall@k nightly is the highest-signal monitor for production RAG.
Centroid shift and neighbor stability catch silent API patches before labels fail.
Model drift requires full re-embed; query drift may need new labels or fine-tuning.
Harbor Support cut drift detection from ~18 days to under 6 hours with stratified regression alerts.