Guide
LLM Self-RAG explained
Harbor Legal’s employee policy assistant indexed 3,800 HR handbook pages, state addenda, and benefits summaries. Lawyers asked precise questions: “Can contractors in California accrue PTO during a 30-day notice period?” Dense retrieval returned generic PTO pages that never mentioned contractors. The generator filled gaps from parametric memory and sounded authoritative. On 95 compliance probes, unsupported or contradicted claims appeared in 28% of answers even when the correct addendum existed in the corpus.
The team deployed Self-RAG: the same LLM that answers also emits reflection tokens that grade whether to retrieve, whether passages support the draft, and whether the draft is useful. Unsupported segments trigger re-retrieval or abstention instead of shipping confident hallucinations. Unsupported-claim rate fell to 7%; answer faithfulness on the golden set rose from 64% to 89%. This guide covers reflection-token design, the retrieve–generate–critique loop, inference-time branching, the Harbor Legal refactor, a technique decision table versus Corrective RAG and plain RAG, pitfalls, and a production checklist.
What Self-RAG adds beyond one-shot RAG
Standard RAG retrieves once and generates once. Failures happen in two places: bad retrieval (wrong chunks) and bad generation (the model ignores or extrapolates beyond context). Post-hoc NLI faithfulness checks catch some errors but add a separate model and still waste a full synthesis call on doomed context.
Self-RAG unifies control in the generator through reflection tokens — special outputs the model learns to emit at decision points. At inference time the runtime parses these tokens and branches:
- Retrieve or not — skip retrieval when parametric knowledge is sufficient (saves latency on definitional FAQs).
- Relevance (IsREL) — grade each retrieved passage before synthesis.
- Support (IsSUP) — grade whether each sentence in the draft is grounded in cited passages.
- Usefulness (IsUSE) — grade whether the overall answer addresses the user query (catches evasive or incomplete responses).
The loop can iterate: retrieve, draft, critique support, re-retrieve with a refined query, draft again. Unlike open-ended agentic RAG, Self-RAG keeps branching on a fixed token vocabulary — easier to cap cost and reason about than free-form ReAct planning.
Reflection tokens and the control loop
The Self-RAG paper formalizes three critique dimensions. Production systems map them to discrete labels the runtime acts on:
Retrieval decision
Before the first retrieve, the model outputs [Retrieve] or
[No Retrieve]. Use No Retrieve only when you have
verified on a holdout set that parametric answers are safe for that intent class
(e.g. “What does PTO stand for?”). For regulated domains, default
to always retrieve and treat No Retrieve as an optimization, not the default.
IsREL: passage relevance
For each chunk in top-k, the model emits [Relevant] or
[Irrelevant]. Irrelevant chunks are dropped before generation.
This overlaps with cross-encoder grading in
CRAG
but uses the same LLM that will synthesize — helpful when relevance requires
nuanced legal or medical reading, not just embedding similarity.
IsSUP: grounded support
After a draft sentence, the model grades [Fully supported],
[Partially supported], or [No support] against the
retained passages. Partially supported triggers sentence rewrite or citation
trim; no support triggers re-retrieve or abstain. This is the core anti-hallucination
lever for Harbor Legal.
IsUSE: answer quality
A final [Useful] / [Not useful] check on the complete
answer. Not useful loops back with a paraphrased query or escalates to human
review. Pair with
RAG evaluation metrics
so usefulness thresholds match product SLAs.
Training vs inference-only Self-RAG
Full Self-RAG as published includes supervised fine-tuning on synthetic trajectories: a teacher model generates (query, passage, reflection labels, answer) tuples. The student learns to emit reflection tokens without a separate critic model at runtime.
Many teams start with inference-only approximations before fine-tuning:
- Structured critique prompts — same LLM, separate JSON-schema calls for IsREL/IsSUP/IsUSE after each stage. Higher latency, no weight update required.
- Adapter critics — small LoRA heads on a frozen base for each reflection type; cheaper than full SFT.
- Hybrid with NLI — cross-encoder for IsREL, LLM for IsSUP on sentences NLI flags as weak. Cuts token cost on long drafts.
Harbor Legal shipped inference-only structured critique first (two weeks), then fine-tuned a 8B adapter on 12,000 labeled trajectories from lawyer-reviewed traces. Faithfulness gained another 6 points; p95 latency dropped 22% because single-pass reflection replaced three serial critique calls.
Harbor Legal refactor (worked example)
Before Self-RAG: dense index, top-5 retrieval, single-shot GPT-4-class synthesis, optional post-hoc NLI on the full answer. Compliance probe results:
- Answer faithfulness (human rubric): 64%
- Unsupported or contradicted claims: 28%
- Abstain rate on unanswerable questions: 4% (too low — model guessed)
- p95 end-to-end latency: 2.1 s
After Self-RAG (always retrieve, IsREL on top-8, draft with citations, IsSUP per sentence, max two retrieve iterations, abstain if no fully supported path):
- Faithfulness: 64% → 89%
- Unsupported claims: 28% → 7%
- Abstain on unanswerable: 4% → 41% (intentional — lawyers preferred “not in handbook” over invention)
- Re-retrieve triggered: 19% of queries (mostly after IsSUP failure)
- p95 latency: 2.1 s → 3.4 s (+1.3 s; acceptable for internal legal tool)
- Generator cost: +68% tokens (critique passes); offset partly by dropping irrelevant chunks before synthesis
CRAG had been tried earlier: it fixed wrong-topic retrieval but left right-topic, wrong-sentence hallucinations when the model blended handbook text with outdated training knowledge. Self-RAG’s IsSUP loop targeted that gap.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Plain RAG + rerank | Retrieval quality is high; hallucination risk is low | Model extrapolates beyond context; regulated domains |
| Corrective RAG (CRAG) | First-pass retrieval is noisy; need discard and fallback search | Retrieved passages are on-topic but generation ignores them |
| Self-RAG | Need sentence-level grounding and iterative critique; abstain is acceptable | Strict latency budget; cannot afford 2–3 LLM passes per query |
| Post-hoc NLI only | Single-pass latency required; failures can be retried asynchronously | Users see hallucinations before the check fires; no mid-draft correction |
| Full agentic RAG | Multi-hop tools, SQL, graphs; open-ended planning | FAQ with faithfulness as the main failure mode; cost unpredictable |
Stack Self-RAG after CRAG hygiene: CRAG cleans retrieval batches; Self-RAG grades drafts. Running both adds latency — Harbor Legal uses CRAG-style cross-encoder pre-filter only when IsREL scores from the base model disagree with embedding ranks.
Common pitfalls
- Reflection without abstain policy — IsSUP failures that loop forever or still ship the last draft; cap iterations and default to “cannot verify.”
- No Retrieve on compliance paths — parametric memory violates policy; disable No Retrieve for regulated intents.
- Critique on the full corpus — IsREL must run on top-k only; grading hundreds of chunks per query is cost-prohibitive.
- Partial support treated as pass — legal and medical products should rewrite or drop partially supported sentences, not footnote them.
- Missing citation anchors — IsSUP needs sentence-to-passage alignment; generate with inline chunk IDs from retrieval metadata.
- SFT on synthetic labels only — reflection tokens drift from production failure modes; mix lawyer- or SME-labeled traces.
- Ignoring usefulness false positives — fluent but wrong answers can score Useful; combine IsUSE with IsSUP, not either alone.
Production checklist
- Define reflection vocabulary (Retrieve, IsREL, IsSUP, IsUSE) and runtime parser.
- Set max retrieve iterations (typically 2) and abstain copy for ungrounded paths.
- Require chunk IDs in drafts so IsSUP can map sentences to sources.
- Build a golden set with unsupported-claim labels, not only answer correctness.
- Log each reflection branch, iteration count, and abstain reason per query.
- Alert when re-retrieve rate spikes (index drift or embedding regression).
- Compare p50/p95 latency and token cost vs plain RAG on the same probe set.
- Disable No Retrieve for high-risk intent classes until verified safe.
- Review abstained and partially supported traces weekly with domain experts.
- Document escalation path when all iterations fail (human ticket, not silent guess).
Key takeaways
- Self-RAG teaches the generator to critique its own retrieval and drafts via reflection tokens.
- IsREL filters chunks; IsSUP grounds sentences; IsUSE gates shipping — use all three in regulated domains.
- CRAG fixes bad retrieval; Self-RAG fixes bad generation — they address different failure modes.
- Harbor Legal cut unsupported claims from 28% to 7% with +1.3 s p95 and higher abstain on unanswerables.
- Start with structured critique prompts; fine-tune reflection adapters when trace volume justifies it.
Related reading
- Corrective RAG explained — grade-then-branch retrieval before synthesis
- Agentic RAG explained — multi-step retrieval and ReAct orchestration
- NLI faithfulness for RAG — entailment grading and claim verification
- RAG evaluation explained — faithfulness metrics and golden QA sets