Guide

LLM RAG document freshness decay explained

Harbor HR’s employee policy bot indexed every handbook revision since 2019. When someone asked “How many PTO days do new hires get?” dense retrieval returned three near-identical chunks: the 2026 policy (15 days), the 2024 policy (12 days), and a 2021 archive (10 days). Cosine similarity scores clustered at 0.84, 0.83, and 0.82 — the model synthesized 12 days because that chunk had slightly richer wording. On 120 freshness-sensitive probes, outdated-policy answers hit 34% even though the correct 2026 document was always in the index.

Incremental sync was healthy; embeddings were current. The failure was ranking: semantic similarity alone treats a retired policy and its replacement as interchangeable. Engineers added document freshness decay at query time: a monotonic boost derived from effective_date metadata, plus hard supersession rules when supersedes_doc_id links exist. Outdated-policy answers fell to 8%; recall@5 on time-sensitive probes rose from 61% to 89%. This guide covers decay functions, version graphs, query-time versus index-time boosting, interaction with incremental index updates, the Harbor HR refactor, a technique decision table versus hard date filters and clarification gates, pitfalls, and a production checklist. It complements metadata filtering and RAG evaluation for teams where wrong-era answers carry real compliance cost.

Why semantic similarity ignores document age

Embedding models optimize for paraphrase similarity, not temporal authority. A superseded travel-expense policy and its replacement share vocabulary, structure, and section headings — often producing vectors closer to each other than to unrelated current docs. Without an explicit freshness signal, top-k retrieval becomes a lottery among versions.

This is separate from stale vectors (old text still embedded after a source edit). Freshness decay addresses stale authority: multiple valid versions coexist in the index because archives are intentionally retained for audit, legal hold, or employee grandfathering. You need ranking logic that prefers the authoritative edition without deleting history.

Freshness metadata every chunk should carry

Decay functions are only as good as timestamps and version links. Minimum viable payload fields:

  • effective_date — when the policy or doc became authoritative (ISO 8601 date, not file mtime).
  • expires_at (optional) — hard sunset for time-bounded notices; null means open-ended until superseded.
  • doc_version — monotonic integer or semver string for human debugging.
  • supersedes_doc_id / superseded_by — explicit version graph edges; stronger than date alone when backdated corrections ship.
  • statusactive, deprecated, archived; tombstone archived rows from default retrieval via pre-filters.
  • audience_scope (optional) — region, role, or cohort when grandfathered rules apply to subsets only.

Ingest pipelines should reject chunks missing effective_date for policy-class sources. File modification time from SharePoint or S3 is a poor proxy — editors touch formatting without changing policy substance.

Decay functions: how to score recency

Let age_days = max(0, today - effective_date) and s be the base similarity score from dense or hybrid retrieval. Apply a multiplicative freshness factor f(age) in [0, 1]:

Exponential decay (default for policies)

f(age) = exp(-λ · age_days) with half-life t½ = ln(2) / λ. Harbor HR used λ = 0.01 (half-life ~69 days): a 2024 chunk at age 730 days gets factor 0.0007; a 2026 chunk at age 30 days gets 0.74. Final score: s' = s × f(age). Tune half-life per corpus — HR policies change yearly; security bulletins need t½ < 14 days.

Step / plateau decay (regulatory corpora)

Full weight while status = active; zero weight (or exclude) when deprecated. Use when legal mandates exactly one current edition and archives must never surface in default answers.

Linear ramp (news and release notes)

f(age) = max(0, 1 - age_days / H) for horizon H (e.g. 180 days). Simpler to explain to stakeholders; less aggressive tail suppression than exponential.

Supersession override

When chunk A has superseded_by = B, set f = 0 for A regardless of age if B is active in the same audience scope. This fixes backdated effective_date errors and same-day publish races better than decay alone.

Query-time vs index-time boosting

Query-time decay (Harbor HR approach): retrieve top-N by base similarity, re-rank with s' = s × f(age) plus supersession rules. Pros: tune λ without re-embedding; A/B test half-lives in production. Cons: extra CPU per query; must retrieve enough candidates that old high-similarity chunks do not crowd out recent moderate matches (Harbor used N=40, cut to 8 after rerank).

Index-time boosting: bake decay into stored scores or maintain a separate freshness_vector dimension. Pros: cheaper at query time. Cons: scores go stale daily unless a scheduled job refreshes factors; harder to experiment.

Metadata pre-filter: restrict to effective_date >= cutoff when queries imply “current” (“today’s policy”, “latest”). Pair with decay rather than replacing it — users rarely say “current” explicitly.

For hybrid BM25+dense pipelines, apply decay after reciprocal rank fusion so lexical ties on boilerplate headings do not bypass recency.

When users need historical answers

Not every query wants the newest doc. Signals for historical mode:

  • Explicit time phrases (“in 2022”, “before the merger”).
  • Comparative intent (“how did PTO change?”) — route to a timeline synthesis path with multiple versions in context.
  • Audit / legal role metadata on the user session.

Disable or invert decay when historical mode fires; otherwise compliance officers cannot retrieve retired rules they are required to cite. This overlaps with clarification gates when scope is unclear (“Which plan year?”).

Harbor HR refactor (worked example)

Before: 14,200 policy chunks; nightly incremental sync via content-hash diff; pure cosine top-8; no date metadata on 22% of legacy PDFs.

  • Backfill effective_date from legal’s version registry; flag unparseable PDFs for manual tagging.
  • Build supersession graph from “Replaces document ID” fields in the source CMS.
  • Retrieve 40 candidates; apply exponential decay (t½ = 90 days) + supersession zeroing; cross-encoder rerank top 12 to final 8.
  • Eval: 120 time-sensitive probes + 200 ahistorical controls; track outdated-policy rate and unnecessary-archive rate separately.

After: outdated-policy rate 34%→8%; p95 latency +11 ms; archive-only answers on ahistorical controls 2% (within tolerance); manual policy tickets −41% over six weeks.

Technique decision table

Approach Best when Weak when
Similarity-only retrieval Evergreen technical docs; single authoritative version per topic Versioned policies, price lists, org charts, security advisories
Query-time freshness decay Multiple coexisting versions; need tunable half-life; incremental sync already works Massive candidate pools without rerank budget; missing date metadata
Hard effective_date filter Queries always imply “current”; single active flag reliable Grandfathered cohorts; historical comparisons; backdated corrections
Supersession graph only Clean CMS with explicit replace links; legal needs full archive Organic wiki sprawl without version discipline
Delete old chunks on publish No audit requirement; minimize storage Compliance retention; “what changed?” questions; rollback
Clarification gate (“which year?”) Ambiguous scope with small candidate year set Users expect silent current-default; high friction on mobile

Freshness decay pairs with cross-encoder reranking when decay alone leaves semantically noisy top-k; rerankers are relatively blind to calendar metadata unless you inject effective_date into the passage header at index time.

Common pitfalls

  • Using file mtime as effective_date — cosmetic edits reset decay incorrectly; substantive backdates break supersession order.
  • Decay without supersession — two active versions with the same effective_date still tie; graph edges disambiguate.
  • Over-aggressive half-life — valid evergreen pages (ethics principles, API concepts) get suppressed; scope decay by content_class.
  • Retrieving too few candidates — top-8 by similarity may be all legacy; expand pool before decay rerank.
  • Ignoring audience_scope — US policy ranks over EU grandfathered rules for the wrong employee.
  • No eval split — tuning λ on the same set you ship hides regressions on historical-intent queries.
  • Assuming incremental sync fixes ranking — fresh embeddings ≠ fresh authority; both layers matter.

Production checklist

  • Require effective_date and status on versioned source types.
  • Model supersession links at ingest; validate no cycles in the version graph.
  • Choose decay family and half-life per content class; document defaults.
  • Retrieve expanded candidate set; apply decay + supersession; then rerank.
  • Pre-filter status != archived for default employee-facing bots.
  • Implement historical-mode detection to bypass decay when appropriate.
  • Log base score, decay factor, and final score for debugging stale answers.
  • Build time-sensitive eval probes separate from general RAG QA.
  • Re-tune half-life after major corpus reorganization or CMS migration.
  • Coordinate with incremental sync owners so tombstones and decay rules agree.

Key takeaways

  • Semantic similarity treats retired policies and their replacements as near-duplicates — freshness must be explicit in ranking, not assumed from ingest cadence.
  • Exponential decay with supersession graph overrides is the default pattern for versioned policy corpora; tune half-life per content class.
  • Query-time decay lets you A/B tune recency without re-embedding; expand the candidate pool so old high-similarity chunks do not block reranking.
  • Harbor HR cut outdated-policy answers from 34% to 8% with +11 ms p95 latency — ranking fix, not a new embedding model.
  • Pair freshness decay with historical-mode routing and clarification gates when users need past editions or ambiguous plan years.

Related reading