Guide
LLM RAG evaluation and retrieval metrics explained
Harbor Legal shipped an internal policy Q&A bot over 4,200 pages of employee handbook, vendor contracts, and state privacy addenda. Product managers celebrated a polished demo: answers sounded authoritative and cited paragraph numbers. Compliance ran a blind audit of 120 real questions auditors had asked in the prior year. 41% of bot answers were fluent but unsupported — the cited chunk did not contain the claim, or the correct policy lived in a document the retriever never surfaced. Worse, manual spot checks on “looks good” samples had masked the failure because reviewers remembered the policies and mentally filled gaps the model invented.
RAG quality is not one number. Retrieval metrics ask whether the right evidence arrived in the context window. Generation metrics ask whether the answer faithfully uses that evidence. Conflating them sends teams to rerankers when the embedding index is stale, or to bigger models when retrieval already missed the only relevant clause. This guide covers offline retrieval scores (Precision@k, Recall@k, MRR, nDCG), end-to-end RAG scores (faithfulness, answer relevance, context precision), golden-set design, human vs online evaluation loops, the Harbor Legal refactor, a technique decision table, pitfalls, and a checklist.
Why split retrieval from generation evaluation
A RAG pipeline has two stochastic stages. Stage one maps a query to chunks; stage two conditions generation on those chunks. Failure modes are distinct:
- Retrieval miss — correct document exists but never ranks in top-k.
- Retrieval noise — plausible but wrong chunks crowd out signal.
- Generation drift — retrieved text is correct but the model paraphrases beyond evidence.
- Generation ignore — model answers from parametric memory and treats context as decoration.
Measuring only end-user thumbs-up blends these failures. Harbor Legal learned that 41% unsupported answers decomposed into 27% retrieval miss, 9% noise after a bad chunking migration, and 5% pure hallucination on otherwise good context. Fixing retrieval first (re-embed, hybrid BM25, parent-child chunks) lifted faithfulness to 78% before any generator swap. Teams that jump straight to Corrective RAG without per-stage metrics often cannot tell whether the grader or the index was broken.
Retrieval metrics: Precision@k, Recall@k, MRR, and nDCG
Offline retrieval evaluation needs a golden set: tuples
of (query, relevant_doc_ids[]) labeled by domain experts.
Given a fixed k (often 5, 10, or 20), compute:
Precision@k and Recall@k
Precision@k is the fraction of the top-k retrieved chunks that are relevant. High precision means little noise in the context window — critical when k is small and every token counts toward context budget. Recall@k is the fraction of all known relevant documents that appear in top-k. Low recall with high precision usually means the index is precise but incomplete: synonyms, acronyms, or tables are missing from embeddings.
Mean reciprocal rank (MRR)
MRR averages 1 / rank_of_first_relevant across queries.
It rewards getting any correct chunk to the top. Useful when
one gold paragraph is enough to answer, less useful when answers
require synthesizing three dispersed sections (compliance cross-refs,
multi-step procedures).
Normalized discounted cumulative gain (nDCG@k)
nDCG@k handles graded relevance (marginally related vs authoritative) and penalizes burying the best doc at rank 9. Harbor Legal weighted “binding policy” chunks 3x over “FAQ summary” chunks. After a reranker deploy, MRR rose 8 points but nDCG barely moved — exposing that the cross-encoder promoted readable summaries over controlling statutes. nDCG is the metric to watch when legal, medical, or finance teams care about which source wins, not just whether one acceptable hit exists.
Report retrieval metrics per segment (HR vs vendor vs state addenda). Aggregate dashboards hid Harbor’s vendor-contract bucket at 0.31 Recall@10 while HR looked healthy at 0.81.
End-to-end RAG metrics: faithfulness, relevance, and context scores
Once chunks are fixed in a test harness, evaluate the full retrieve-then-generate path. Frameworks like RAGAS and TruLens popularized a consistent vocabulary:
- Faithfulness (groundedness) — are answer claims supported by retrieved context? Binary or NLI-based entailment per sentence.
- Answer relevance — does the response address the user question without padding or topic drift?
- Context precision — of retrieved chunks, how many were needed to produce a correct answer?
- Context recall — of all information required to answer, how much appeared in retrieval?
Faithfulness is the compliance metric: Harbor blocked production rollout until faithfulness on the golden set exceeded 85%. Answer relevance catches polite refusals and over-cautious “consult legal” boilerplate that is technically grounded but useless.
Pair end-to-end scores with response grounding checks at serving time for high-risk queries — offline faithfulness on June docs does not guarantee January answers after policy updates.
Building and maintaining golden Q&A sets
A golden set is a living contract between product and domain experts. Harbor Legal’s final set had 340 questions stratified by difficulty:
- Lookup (40%) — single fact, one gold chunk.
- Synthesis (35%) — combine two policies (e.g., remote work + state tax nexus).
- Adversarial (25%) — outdated FAQ vs current handbook, near-duplicate vendor clauses, acronym collisions.
Each row stores: question text, gold document IDs (and optionally gold
answer span), expected answer type (yes/no, list, procedure), and
valid_until date when policies expire. Without expiry,
golden sets become false negatives after reorgs — the bot improves
but metrics fall because labels are stale.
Version golden sets like code: golden-v3.2.json tied to
index snapshot IDs. When Harbor re-chunked vendor PDFs, they re-ran
retrieval-only eval before touching the generator. Regression CI failed
twice on Recall@10 before the change shipped, saving another audit
embarrassment.
Human labeling, LLM-as-judge, and online feedback
When humans still win
Expert labelers remain the ground truth for initial golden sets and for disputing LLM judges. Harbor used two human passes on adversarial rows; Cohen’s kappa below 0.7 meant the rubric was ambiguous, not that the model was fine.
LLM-as-judge trade-offs
A strong model scoring faithfulness scales to thousands of rows cheaply. Risks: position bias (prefers first chunk), leniency toward fluent hallucinations, and rubric drift when the judge model updates. Mitigate with chain-of-thought scoring templates, calibration on 50 human-labeled rows per month, and never letting the judge model be identical to the generator without a held-out variant.
Online signals
Thumbs-down, “wrong citation” clicks, and escalation to human agents are lagging indicators but catch distribution shift. Harbor pipes disputed tickets back into the adversarial bucket monthly. See online evaluation for A/B harness design that does not conflate retrieval experiments with prompt tweaks.
Harbor Legal refactor: 59% to 91% faithfulness in six weeks
Week 1–2: built golden-v1 (120 rows), measured baseline — Recall@10 0.54, faithfulness 0.59. Week 3: hybrid dense+BM25, fixed table extraction on vendor PDFs, Recall@10 0.71. Week 4: parent-child chunking for cross-referenced handbook sections. Week 5: cross-encoder reranker on top-50, nDCG@10 +0.19. Week 6: citation-forced prompt (“quote supporting sentence before conclusion”) plus Self-RAG-style IsSUP check on 12% of traffic. Final golden-v2 (340 rows): faithfulness 0.91, answer relevance 0.87, unsupported audit rate 6% (down from 41%).
The key cultural shift: no deploy without a retrieval-only dashboard. Product could no longer argue “GPT-5 will fix it” when Recall@10 on vendor contracts was 0.31.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Manual spot checks | Early demos, qualitative tone | Any compliance or revenue-critical bot |
| Retrieval-only golden eval | Index, chunking, embedding experiments | Measuring hallucination on good context |
| End-to-end RAGAS-style suite | Pre-release gates, regression CI | Labeling cost on niche domains without experts |
| LLM-as-judge at scale | Nightly regression on 1k+ rows | High-stakes without human calibration |
| Online thumbs + escalation mining | Detecting drift after policy changes | Low traffic; noisy without volume |
Common pitfalls
- Single aggregate score — masks segment collapse (Harbor vendor bucket).
- Gold labels without doc IDs — cannot debug retrieval vs generation.
- Evaluating on training questions — memorized chunk boundaries inflate Recall.
- Ignoring k sensitivity — Precision@3 can look fine while Recall@20 is catastrophic for synthesis questions.
- Judge-model same as generator — shared blind spots inflate faithfulness.
- Stale golden sets — punishes correct answers after doc updates.
- Skipping retrieval-only CI — reranker and CRAG layers hide rotten indexes.
Designer and engineer checklist
- Define retrieval and generation metrics separately; publish both on dashboards.
- Build stratified golden set with doc IDs, difficulty tags, and expiry dates.
- Compute Precision@k, Recall@k, MRR, and nDCG@k per segment weekly.
- Run retrieval-only regression in CI before generator or prompt changes.
- Measure faithfulness, answer relevance, context precision on full RAG path.
- Calibrate LLM judges against human labels; rotate judge model version.
- Log retrieved chunk IDs on every production answer for dispute replay.
- Mine escalations and thumbs-down into adversarial golden rows monthly.
- Version golden sets with index snapshot IDs; block deploy on regression.
- Pair offline scores with online eval for policy and catalog drift.
Key takeaways
- RAG quality is two problems — retrieval and generation need separate metrics.
- nDCG and Recall@k catch failures MRR hides when ranked summaries beat authoritative sources.
- Faithfulness is the compliance gate — fluent wrong citations are worse than “I don’t know.”
- Golden sets are products — versioned, stratified, expired, and fed by production disputes.
- Harbor cut unsupported audit answers 41% → 6% by fixing retrieval before upgrading the generator.
Related reading
- Reranking — two-stage retrieval and cross-encoder scoring
- Corrective RAG — grade-then-branch when first-pass retrieval fails
- Self-RAG — reflection tokens for retrieval and support critique
- Agent evaluation — tool trajectories beyond single-turn RAG