Guide

LLM Self-RAG explained

Harbor Legal’s employee policy assistant indexed 3,800 HR handbook pages, state addenda, and benefits summaries. Lawyers asked precise questions: “Can contractors in California accrue PTO during a 30-day notice period?” Dense retrieval returned generic PTO pages that never mentioned contractors. The generator filled gaps from parametric memory and sounded authoritative. On 95 compliance probes, unsupported or contradicted claims appeared in 28% of answers even when the correct addendum existed in the corpus.

The team deployed Self-RAG: the same LLM that answers also emits reflection tokens that grade whether to retrieve, whether passages support the draft, and whether the draft is useful. Unsupported segments trigger re-retrieval or abstention instead of shipping confident hallucinations. Unsupported-claim rate fell to 7%; answer faithfulness on the golden set rose from 64% to 89%. This guide covers reflection-token design, the retrieve–generate–critique loop, inference-time branching, the Harbor Legal refactor, a technique decision table versus Corrective RAG and plain RAG, pitfalls, and a production checklist.

What Self-RAG adds beyond one-shot RAG

Standard RAG retrieves once and generates once. Failures happen in two places: bad retrieval (wrong chunks) and bad generation (the model ignores or extrapolates beyond context). Post-hoc NLI faithfulness checks catch some errors but add a separate model and still waste a full synthesis call on doomed context.

Self-RAG unifies control in the generator through reflection tokens — special outputs the model learns to emit at decision points. At inference time the runtime parses these tokens and branches:

Retrieve or not — skip retrieval when parametric knowledge is sufficient (saves latency on definitional FAQs).
Relevance (IsREL) — grade each retrieved passage before synthesis.
Support (IsSUP) — grade whether each sentence in the draft is grounded in cited passages.
Usefulness (IsUSE) — grade whether the overall answer addresses the user query (catches evasive or incomplete responses).

The loop can iterate: retrieve, draft, critique support, re-retrieve with a refined query, draft again. Unlike open-ended agentic RAG, Self-RAG keeps branching on a fixed token vocabulary — easier to cap cost and reason about than free-form ReAct planning.

Reflection tokens and the control loop

The Self-RAG paper formalizes three critique dimensions. Production systems map them to discrete labels the runtime acts on:

Retrieval decision

Before the first retrieve, the model outputs [Retrieve] or [No Retrieve]. Use No Retrieve only when you have verified on a holdout set that parametric answers are safe for that intent class (e.g. “What does PTO stand for?”). For regulated domains, default to always retrieve and treat No Retrieve as an optimization, not the default.

IsREL: passage relevance

For each chunk in top-k, the model emits [Relevant] or [Irrelevant]. Irrelevant chunks are dropped before generation. This overlaps with cross-encoder grading in CRAG but uses the same LLM that will synthesize — helpful when relevance requires nuanced legal or medical reading, not just embedding similarity.

IsSUP: grounded support

After a draft sentence, the model grades [Fully supported], [Partially supported], or [No support] against the retained passages. Partially supported triggers sentence rewrite or citation trim; no support triggers re-retrieve or abstain. This is the core anti-hallucination lever for Harbor Legal.

IsUSE: answer quality

A final [Useful] / [Not useful] check on the complete answer. Not useful loops back with a paraphrased query or escalates to human review. Pair with RAG evaluation metrics so usefulness thresholds match product SLAs.

Training vs inference-only Self-RAG

Full Self-RAG as published includes supervised fine-tuning on synthetic trajectories: a teacher model generates (query, passage, reflection labels, answer) tuples. The student learns to emit reflection tokens without a separate critic model at runtime.

Many teams start with inference-only approximations before fine-tuning:

Structured critique prompts — same LLM, separate JSON-schema calls for IsREL/IsSUP/IsUSE after each stage. Higher latency, no weight update required.
Adapter critics — small LoRA heads on a frozen base for each reflection type; cheaper than full SFT.
Hybrid with NLI — cross-encoder for IsREL, LLM for IsSUP on sentences NLI flags as weak. Cuts token cost on long drafts.

Harbor Legal shipped inference-only structured critique first (two weeks), then fine-tuned a 8B adapter on 12,000 labeled trajectories from lawyer-reviewed traces. Faithfulness gained another 6 points; p95 latency dropped 22% because single-pass reflection replaced three serial critique calls.

Harbor Legal refactor (worked example)

Before Self-RAG: dense index, top-5 retrieval, single-shot GPT-4-class synthesis, optional post-hoc NLI on the full answer. Compliance probe results:

Answer faithfulness (human rubric): 64%
Unsupported or contradicted claims: 28%
Abstain rate on unanswerable questions: 4% (too low — model guessed)
p95 end-to-end latency: 2.1 s

After Self-RAG (always retrieve, IsREL on top-8, draft with citations, IsSUP per sentence, max two retrieve iterations, abstain if no fully supported path):

Faithfulness: 64% → 89%
Unsupported claims: 28% → 7%
Abstain on unanswerable: 4% → 41% (intentional — lawyers preferred “not in handbook” over invention)
Re-retrieve triggered: 19% of queries (mostly after IsSUP failure)
p95 latency: 2.1 s → 3.4 s (+1.3 s; acceptable for internal legal tool)
Generator cost: +68% tokens (critique passes); offset partly by dropping irrelevant chunks before synthesis

CRAG had been tried earlier: it fixed wrong-topic retrieval but left right-topic, wrong-sentence hallucinations when the model blended handbook text with outdated training knowledge. Self-RAG’s IsSUP loop targeted that gap.

Technique decision table

Approach	Best when	Weak when
Plain RAG + rerank	Retrieval quality is high; hallucination risk is low	Model extrapolates beyond context; regulated domains
Corrective RAG (CRAG)	First-pass retrieval is noisy; need discard and fallback search	Retrieved passages are on-topic but generation ignores them
Self-RAG	Need sentence-level grounding and iterative critique; abstain is acceptable	Strict latency budget; cannot afford 2–3 LLM passes per query
Post-hoc NLI only	Single-pass latency required; failures can be retried asynchronously	Users see hallucinations before the check fires; no mid-draft correction
Full agentic RAG	Multi-hop tools, SQL, graphs; open-ended planning	FAQ with faithfulness as the main failure mode; cost unpredictable

Stack Self-RAG after CRAG hygiene: CRAG cleans retrieval batches; Self-RAG grades drafts. Running both adds latency — Harbor Legal uses CRAG-style cross-encoder pre-filter only when IsREL scores from the base model disagree with embedding ranks.

Common pitfalls

Reflection without abstain policy — IsSUP failures that loop forever or still ship the last draft; cap iterations and default to “cannot verify.”
No Retrieve on compliance paths — parametric memory violates policy; disable No Retrieve for regulated intents.
Critique on the full corpus — IsREL must run on top-k only; grading hundreds of chunks per query is cost-prohibitive.
Partial support treated as pass — legal and medical products should rewrite or drop partially supported sentences, not footnote them.
Missing citation anchors — IsSUP needs sentence-to-passage alignment; generate with inline chunk IDs from retrieval metadata.
SFT on synthetic labels only — reflection tokens drift from production failure modes; mix lawyer- or SME-labeled traces.
Ignoring usefulness false positives — fluent but wrong answers can score Useful; combine IsUSE with IsSUP, not either alone.

Production checklist

Define reflection vocabulary (Retrieve, IsREL, IsSUP, IsUSE) and runtime parser.
Set max retrieve iterations (typically 2) and abstain copy for ungrounded paths.
Require chunk IDs in drafts so IsSUP can map sentences to sources.
Build a golden set with unsupported-claim labels, not only answer correctness.
Log each reflection branch, iteration count, and abstain reason per query.
Alert when re-retrieve rate spikes (index drift or embedding regression).
Compare p50/p95 latency and token cost vs plain RAG on the same probe set.
Disable No Retrieve for high-risk intent classes until verified safe.
Review abstained and partially supported traces weekly with domain experts.
Document escalation path when all iterations fail (human ticket, not silent guess).

Key takeaways

Self-RAG teaches the generator to critique its own retrieval and drafts via reflection tokens.
IsREL filters chunks; IsSUP grounds sentences; IsUSE gates shipping — use all three in regulated domains.
CRAG fixes bad retrieval; Self-RAG fixes bad generation — they address different failure modes.
Harbor Legal cut unsupported claims from 28% to 7% with +1.3 s p95 and higher abstain on unanswerables.
Start with structured critique prompts; fine-tune reflection adapters when trace volume justifies it.