Guide

LLM Corrective RAG (CRAG) explained

Harbor Analytics’ internal metrics assistant indexed 14,000 Confluence pages and 2,200 dbt model descriptions. Product managers asked questions like “What is churn rate for Harbor Pro in Q1 excluding trials?” Dense retrieval reliably returned onboarding docs that mentioned churn but defined a different metric. The generator synthesized a plausible number from the wrong definition. On 110 metric-intent probes, wrong-metric answers hit 31% even though the correct dbt doc existed in the index.

Engineers added Corrective RAG (CRAG): after the first vector search, a lightweight retrieval evaluator labels each hit as correct, incorrect, or ambiguous. Incorrect batches trigger a fallback path (BM25 keyword search on SKU and metric names). Ambiguous batches run knowledge refinement — strip irrelevant sentences from chunks before synthesis. Wrong-metric answers fell to 8%; context precision on the eval set rose from 52% to 79%. This guide covers the CRAG control loop, evaluator design, refinement and fallback routing, the Harbor Analytics refactor, a technique decision table versus reranking-only and full agentic RAG, pitfalls, and a production checklist.

The CRAG control loop

Standard RAG assumes retrieval quality is good enough to generate. CRAG inserts an explicit grade-then-branch step between retrieve and generate:

  1. Retrieve top-k chunks from the primary index (usually dense or hybrid).
  2. Evaluate each chunk (or the batch) for relevance to the user query.
  3. Branch on the evaluator label:
    • Correct — pass chunks to the generator unchanged (optionally after reranking).
    • Incorrect — discard vector hits; run fallback retrieval (keyword search, secondary index, or approved web search).
    • Ambiguous — keep partial signal but refine chunks or decompose the query before a second retrieve pass.
  4. Generate from the corrected context set.

CRAG is narrower than agentic RAG: it does not require the main LLM to plan multi-step tool loops. The evaluator is typically a fast cross-encoder, a small classifier, or a structured yes/no LLM call with a fixed rubric. Latency adds one grading pass — often 30–80 ms with a reranker you may already run.

Retrieval evaluator design

The evaluator is the product decision in CRAG. Three common implementations:

Cross-encoder relevance scores

Score each (query, chunk) pair with a cross-encoder (e.g. BGE-reranker, Cohere Rerank). Map scores to tiers: above 0.72 = correct, 0.45–0.72 = ambiguous, below 0.45 = incorrect. Thresholds must be tuned on your golden QA set, not vendor defaults.

LLM binary rubric

Prompt a small model: “Does this passage contain information sufficient to answer the question? Reply Correct, Ambiguous, or Incorrect.” Cheaper than full generation but slower than cross-encoders at high k. Best when chunks are long and scalar scores mis-rank borderline policy text.

Batch-level vs per-chunk grading

Per-chunk labels enable surgical refinement (keep doc A, drop doc B). Batch-level labels (“is any hit useful?”) are faster and sufficient when k is small (3–5). Harbor Analytics uses per-chunk cross-encoder scores with batch fallback: if all scores fall below 0.40, trigger keyword search regardless of individual ranks.

Knowledge refinement: salvage ambiguous retrieval

When retrieval is ambiguous — chunks mention the topic but bury the answer in noise — CRAG can run knowledge refinement before generation. Two patterns:

  • Strip irrelevant sentences — an LLM or extractive model marks sentences that do not support answering the query; only marked sentences enter the prompt. Reduces contradictory context that causes hedged wrong answers.
  • Decompose-then-retrieve — split the question into atomic sub-queries, retrieve per sub-query, merge. Overlaps with query decomposition in agentic pipelines but stays retrieval-focused without open-ended tool planning.

Refinement costs an extra LLM call per ambiguous batch. Gate it: only run when at least one chunk scores in the ambiguous band and no chunk scores correct. If one chunk is clearly correct, skip refinement and pass the top hit through.

Fallback retrieval paths

The incorrect branch is where CRAG earns its name. Fallback options, in rough order of control:

  • BM25 / keyword search on the same corpus — recovers exact SKU codes, metric names, and legal citations embeddings miss. Pair with hybrid search at index time so the fallback index already exists.
  • Secondary dense index — different embedding model or chunking strategy (e.g. sentence-window index when paragraph chunks fail).
  • Metadata-filtered re-query — broaden filters (product line, date range) when the first pass over-constrained results.
  • Approved web search — only for public docs outside the corpus; requires human-in-the-loop or domain allowlists for enterprise deployments.

Log which branch fired per query. If incorrect triggers more than 25% of traffic, the primary retriever or chunking strategy needs fixing — CRAG is a safety net, not a substitute for index quality.

Harbor Analytics refactor (worked example)

Before CRAG: single dense index, 768-token chunks, top-5 retrieval, no reranker. Metric-intent context precision 52%; wrong-metric answers 31%; p95 retrieval 94 ms.

After CRAG (cross-encoder grade on top-8, refine ambiguous, BM25 fallback on all-incorrect):

  • Context precision on metric probes: 52% → 79%
  • Wrong-metric answers: 31% → 8%
  • Fallback (BM25) triggered: 18% of metric queries (mostly SKU-heavy)
  • Refinement triggered: 11% of queries (ambiguous band only)
  • p95 end-to-end retrieval path: 94 ms → 142 ms (+48 ms grading + occasional fallback)
  • Generator input tokens: 3.8k → 2.4k average after strip refinement

The win was catching confidently wrong context before synthesis. Reranking alone had been tried earlier: it reordered bad chunks but rarely discarded an entire irrelevant batch. CRAG’s incorrect branch was the missing piece for metric-definition questions.

Technique decision table

Approach Best when Weak when
Reranking only Relevant docs exist in top-20; ordering is the main problem Top hits are uniformly off-topic; no salvageable signal
Corrective RAG (CRAG) Noisy retrieval with clear fallback; need discard + re-search without full agents Corpus is tiny; evaluator adds latency with no branch diversity
Hybrid search at retrieve Lexical exact-match matters (SKUs, statutes, error codes) Fallback redundancy if hybrid already runs; CRAG may duplicate BM25 path
Self-RAG Need generation-time faithfulness grading and iterative critique Budget-sensitive; multiple LLM passes per answer
Full agentic RAG (ReAct) Multi-hop reasoning, SQL, graph tools, dynamic planning Simple FAQ; cost and latency unpredictable on edge cases
Query expansion only Vocabulary mismatch; paraphrases improve recall Retrieved chunks are wrong topic, not wrong wording

CRAG stacks cleanly: hybrid first-stage retrieve, cross-encoder grade, CRAG branch, then optional NLI faithfulness check on the final answer. You do not need a ReAct agent to get most of the garbage-in protection.

Common pitfalls

  • Evaluator thresholds copied from papers — CRAG paper defaults do not match your chunk length or domain; tune on labeled failures.
  • Grading after generation — defeats the purpose; grade retrieval before the expensive synthesis call.
  • No fallback index — incorrect branch re-runs the same failed vector search; wire a distinct lexical or secondary path.
  • Refinement on every query — adds cost; gate on ambiguous band only.
  • Web search without allowlists — enterprise CRAG must not leak internal query text to the open web.
  • Ignoring branch telemetry — if incorrect fires constantly, fix chunking or embeddings instead of masking with fallback.
  • Batch grading hides one good hit — one correct chunk among five wrong ones needs per-chunk labels, not batch discard.

Production checklist

  • Define evaluator tiers (correct / ambiguous / incorrect) with score rubric or model prompt.
  • Tune thresholds on a golden set with known retrieval failures, not only answer correctness.
  • Implement at least one fallback path distinct from primary dense search.
  • Gate knowledge refinement on ambiguous-only batches.
  • Log branch label, fallback trigger rate, and refinement rate per query.
  • Alert when incorrect-branch rate exceeds baseline (index drift or embedding regression).
  • Cap fallback latency (timeout BM25/web at 200 ms; degrade to abstain).
  • Run offline eval comparing CRAG vs rerank-only on the same probe set.
  • Document abstain behavior when all branches fail (do not hallucinate).
  • Review CRAG traces weekly on wrong-answer tickets from support.

Key takeaways

  • CRAG grades first-pass retrieval and branches: keep, refine, or discard and re-search.
  • Cross-encoder tiers are the default evaluator; tune thresholds on your failure set, not paper defaults.
  • Knowledge refinement salvages ambiguous batches; incorrect batches need a real fallback index.
  • Harbor Analytics cut wrong-metric answers from 31% to 8% with +48 ms p95 — cheaper than full agent loops.
  • Stack CRAG under agentic RAG for retrieval hygiene; use Self-RAG when generation-time critique is the bottleneck.

Related reading