Guide

LLM RAFT explained

Harbor Compliance's internal policy assistant indexed 3,800 SOC 2 controls, GDPR articles, and vendor security questionnaires. Retrieval worked — the correct policy paragraph usually appeared in the top five chunks. The base model still answered from parametric memory: when a 2025 data-retention update shortened log retention from 90 to 30 days, the bot quoted the old window in 34% of retention-intent probes even though the new paragraph was retrieved first. Engineers had already tried embedding fine-tuning and reranking; recall was not the bottleneck — context utilization was.

They adopted RAFT (Retrieval Augmented Fine-Tuning): supervised fine-tuning on bundles of retrieved documents that mix the oracle (gold) passage with realistic distractors, paired with chain-of-thought answers that cite the correct source. Policy-misread answers fell to 9%; faithfulness on a 200-question holdout rose from 61% to 88%. This guide covers the RAFT training recipe, oracle/distractor curriculum design, data construction, the Harbor Compliance refactor, a technique decision table versus plain RAG without RAFT and Corrective RAG, pitfalls, and a production checklist.

What RAFT changes in the stack

Standard RAG prepends retrieved chunks to the prompt and hopes the base model attends to them. In practice, strong pretrained models often:

  • Answer from memorized training data when chunks conflict with parametric knowledge.
  • Blend facts across distractor passages into a plausible but wrong synthesis.
  • Ignore retrieved text when the question matches a familiar pattern from pretraining.

RAFT does not replace retrieval. It teaches the generator to read a realistic retrieved bundle — including noise — and produce answers grounded in the oracle document. Training mirrors inference: at serving time you still run your vector or hybrid index; the fine-tuned model is better at using whatever comes back.

Contrast with generic SFT on question-answer pairs without retrieved context, which can improve tone but does not train document-selection behavior. RAFT sits in the hybrid zone described in the fine-tuning vs RAG guide: retrieval supplies fresh facts; RAFT teaches the model how to consume them.

The RAFT training recipe

For each training example you need: a question, a document bundle (oracle + distractors), and a supervised answer that reasons over the bundle and cites the oracle.

  1. Build a golden QA set with question, gold answer, and source document ID (from human labels or verified traces).
  2. Simulate retrieval for each question: include the oracle document plus k distractors sampled from (a) top dense hits that are wrong, (b) same-topic near-misses, or (c) random in-corpus noise. Order is shuffled so position bias is not learnable.
  3. Format the prompt like production RAG: system instructions, enumerated documents with titles or IDs, then the user question.
  4. Supervise chain-of-thought + answer: the target output quotes or paraphrases from the oracle, explicitly discards distractors when relevant, and ends with the final user-facing answer. CoT can be stripped at inference if latency-sensitive.
  5. Fine-tune with parameter-efficient methods (LoRA/QLoRA) on 1–10k examples; full fine-tune is rarely needed for domain adapters.

The distractor curriculum is the core insight. Training only on oracle-only context teaches the model a unrealistic setting where every paragraph is relevant. Mixing hard negatives matches the error mode of live retrieval and sharply reduces distraction failures at inference.

Oracle and distractor design

Distractor quality dominates RAFT outcomes. Weak distractors produce a model that still collapses when retrieval returns plausible wrong hits.

Distractor sources (hardest to easiest)

  • Retriever false positives — chunks that rank high but human graders marked irrelevant. Best match to production failure modes.
  • Same-entity wrong version — superseded policy, old pricing table, deprecated API doc. Tests whether the model follows retrieved text over parametric memory.
  • Adjacent topic — GDPR Article 17 when the question is Article 15 access requests. Trains fine-grained discrimination.
  • Random in-corpus — easy negatives; use sparingly so the model does not learn “ignore everything.”

Bundle size and oracle presence

Match inference: if production passes top-5 chunks, train on bundles of five. Always include exactly one oracle in training unless you are explicitly teaching abstention (see pitfalls). Vary oracle rank position (1st through 5th) so the model does not assume the first chunk is always correct.

For domains with structured metadata (policy version, effective date), include those fields in the document header so the model learns to prefer effective_date=2025-03-01 over older versions — complementary to freshness decay at retrieval time.

Supervision format and evaluation

RAFT targets should look like what you want at inference, minus optional CoT stripping:

## Reasoning
Doc [2] states 30-day retention for application logs (effective 2025-03-01).
Doc [1] is the superseded 90-day policy — ignore.
Doc [4] covers backup tapes, not application logs.

## Answer
Application logs must be retained for 30 days per the 2025 data retention policy.

Evaluate with the same metrics as standard RAG plus context utilization:

  • Faithfulness / groundedness — is every claim supported by the oracle (or abstain)? See NLI faithfulness.
  • Context precision@k — retrieval quality; RAFT does not fix recall if the oracle never appears.
  • Distractor resistance — holdout probes where oracle is present but ranked 3rd–5th among hard distractors.
  • Parametric override rate — questions where retrieved text contradicts common pretraining knowledge (policy updates, numeric changes).

Run RAG evaluation before and after RAFT on identical retrieval traces so you isolate generator gains from index changes.

Harbor Compliance refactor

Harbor's baseline: GPT-4 class model, hybrid retrieval, cross-encoder rerank, 5 chunks per query. Retention and subprocessors were the worst intent classes.

Metric (200-probe holdout) Before RAFT After RAFT (LoRA on 7B)
Policy-misread answers34%9%
Faithfulness (NLI entailment)61%88%
Oracle rank 4–5, still correct22%71%
p95 latency (CoT stripped)1.9 s1.2 s

Training set: 6,200 examples from historical tickets with lawyer-verified answers. Distractors: top-8 retriever hits with 3–5 randomly subsampled per example; oracle injected at random rank. LoRA rank 16 on attention projections; three epochs. They kept the frontier model for ambiguous escalations only — RAFT adapter on a 7B served 78% of traffic at one-fifth token cost.

RAFT did not remove the need for Corrective RAG on out-of-corpus questions: when retrieval returned no oracle, the RAFT model abstained more often (good) but still needed a fallback search path for novel vendor questionnaires.

Technique decision table

Your bottleneck Prefer Why not RAFT alone
Retriever misses the gold doc Embedding fine-tune, hybrid search, query expansion RAFT cannot cite a document that never appears in the bundle
Gold doc retrieved but model ignores it RAFT
Wrong chunks ranked above gold Reranking + CRAG fallback RAFT helps rank-4/5 oracle cases; rank-8 misses need retrieval fixes
Style/format only (no factual grounding) Thin LoRA SFT without documents RAFT adds training cost without grounding benefit
Knowledge changes weekly RAG + RAFT on stable reasoning patterns Do not memorize facts in weights; keep corpus as source of truth
No labeled QA or source IDs Improve retrieval + CRAG; label first RAFT needs golden oracle links per question

Stack order that worked at Harbor: fix recall → rerank → RAFT generator → CRAG fallback for empty/incorrect retrieval → NLI faithfulness gate on high-risk answers.

Common pitfalls

  • Oracle-only training bundles — model learns to trust every paragraph; distractor failure rate stays high at inference.
  • Memorizing answers without document headers — shuffled doc IDs in eval expose parametric cheating; always hold out entire document clusters.
  • Stale training corpus — RAFT weights entrench old reasoning patterns; retrain or refresh when policy structure changes, not just re-index.
  • Skipping CoT during training then expecting cite behavior — at minimum supervise which doc ID supported each claim.
  • Bundle size mismatch — train on 3 docs, serve 10; attention patterns do not transfer.
  • Using RAFT instead of retrieval for freshness — facts still live in the index; weights teach reading, not storage.
  • No abstention examples — when oracle is absent from bundle, supervise “insufficient context” or route to search; otherwise model hallucinates from distractors.
  • Evaluating on oracle-at-rank-1 only — hides the main RAFT win on crowded context windows.

Production checklist

  • Confirm retrieval recall@k includes oracle on a golden set before RAFT investment.
  • Label 2k+ examples with question, answer, and oracle document ID.
  • Build distractor sampler from live retriever false positives, not random only.
  • Shuffle oracle rank and document order in every training example.
  • Supervise reasoning that names doc IDs and rejects distractors.
  • Hold out entire policy families or doc clusters from training for honest eval.
  • Measure faithfulness and distractor-resistance separately from retrieval MRR.
  • Strip CoT at inference if latency requires; keep cite metadata in logs.
  • Version RAFT adapters with corpus semver; retrain when schema or policy taxonomy shifts.
  • Keep CRAG or fallback search for bundles where no oracle appears at inference.

Key takeaways

  • RAFT fine-tunes the generator to use retrieved context, not the retriever itself.
  • Training bundles must include hard distractors that mirror live retrieval noise.
  • Chain-of-thought supervision with explicit doc selection beats answer-only SFT for grounding.
  • Harbor Compliance cut policy-misread answers from 34% to 9% without changing the index.
  • Stack RAFT after recall is adequate; pair with CRAG and faithfulness gates for production.

Related reading