Guide
LLM fine-tuning vs RAG explained
Harbor Legal's contract-review assistant launched after a six-week fine-tuning sprint on 40,000 annotated indemnity and limitation-of-liability clauses. Lawyers loved the tone — crisp, cautious, always in passive voice — and early accuracy on held-out clauses hit 91%. Three weeks later, the firm adopted a new master services agreement with a $2M indemnity cap. The bot kept citing the old $500k language with equal confidence. Retraining would take another GPU week, legal sign-off, and a regression pass nobody had scheduled. Partners stopped trusting citations; usage fell 58% in a month.
The rebuild kept a thin LoRA adapter for style and task format but moved all factual clause text into a RAG corpus with version tags and source attribution. Policy updates now ship as index refreshes in hours, not retraining cycles in weeks. Citation accuracy on live agreements rose from 74% to 96%; hallucinated clause numbers dropped 83%. This guide covers what fine-tuning and RAG each change in the model stack, a decision taxonomy by problem type, cost and freshness tradeoffs, evaluation design, hybrid patterns, the Harbor Legal refactor, a technique decision table versus “fine-tune everything,” pitfalls, and a production checklist.
What each technique actually changes
Teams argue about fine-tuning versus RAG because the labels sound interchangeable in slide decks. They operate on different layers.
| Dimension | Fine-tuning (SFT / LoRA / full) | RAG (retrieval + prompt) |
|---|---|---|
| What moves | Model weights (behavior, style, format priors) | External index (documents, embeddings, metadata) |
| Knowledge location | Compressed into parameters; hard to inspect per fact | Explicit chunks with IDs, versions, and ACLs |
| Update cadence | Retrain or adapter refresh; slow, needs eval gates | Re-index documents; fast if pipeline exists |
| Provenance | Opaque; model “knows” without citation | Natural fit for footnotes and source links |
| Typical cost driver | GPU training, data labeling, regression suites | Embedding + vector DB + retrieval latency |
| Failure mode | Stale memorized facts; catastrophic forgetting on narrow data | Missed retrieval; context stuffing; wrong chunk ranking |
Fine-tuning teaches how to behave. RAG supplies what is true right now. Conflating the two leads to the Harbor Legal failure mode: beautiful answers about facts that left the building months ago.
Decision taxonomy by problem type
Map your use case to the primary lever before debating vendors.
| Problem | Prefer | Why |
|---|---|---|
| Volatile factual knowledge (policies, prices, laws) | RAG | Facts change faster than you can retrain; citations required |
| Stable domain reasoning (SQL dialect, API schema shape) | Fine-tuning | Behavioral prior beats retrieving examples every call |
| Brand voice and output format | Fine-tuning or prompt | Small LoRA or strong system prompt; RAG does not fix tone alone |
| Long-tail entity lookup (SKUs, ticket IDs, employee directory) | RAG | Exact-match retrieval scales; fine-tuning cannot memorize millions of rows |
| Tool-calling discipline and JSON shape | Fine-tuning + schema | Structured outputs and SFT on tool traces beat hoping retrieval teaches syntax |
| Proprietary reasoning on public base model | Fine-tuning | Trade secrets in weights may be acceptable; never put secrets only in RAG without ACLs |
| Multi-hop synthesis across many documents | RAG + agent | Agentic RAG with query decomposition; fine-tuning alone does not add new pages at runtime |
| Latency-sensitive high QPS FAQ | Hybrid | Cache top questions; RAG for long tail; optional small fine-tuned router |
Freshness, cost, and latency
Knowledge half-life
Estimate how often ground truth changes. If policy docs update weekly, any full fine-tune is already obsolete at deploy. RAG index pipelines should target time-to-live under one business day for regulated content. If rules are stable for years (chess notation, legacy COBOL patterns), fine-tuning on curated examples is cheaper per query at scale.
Total cost of ownership
Fine-tuning bills once per version in GPU hours, labeler time, and eval reruns. RAG bills continuously in embedding API calls, vector storage, and larger prompt tokens per request. Break even depends on query volume: at 50k requests/day, a $8k fine-tune that removes 2k tokens of retrieved context per call often pays back in weeks. At 500 requests/day, RAG plus prompt engineering wins without training overhead.
Latency budget
Retrieval adds 50–400 ms depending on hybrid search, rerankers, and chunk count. Fine-tuned models skip that hop but may need longer generations if the answer was not internalized. Measure P95 end-to-end; a fast wrong fine-tuned answer is not cheaper than a slower cited RAG answer in compliance workflows.
Evaluation design
You cannot choose fine-tuning versus RAG without metrics that separate behavior from facts:
- Factuality / citation accuracy — does each claim match an authorized source? RAG-heavy systems score this directly; fine-tuned models need human or LLM-judge checks against a gold document set.
- Retrieval recall@k — is the correct chunk in the top-k context? Failures here look like model hallucinations but are search bugs.
- Format compliance — JSON schema, section headers, mandatory disclaimers. Fine-tuning and constrained decoding excel; RAG does not fix format drift alone.
- Style rubric — tone, reading level, brevity. Often cheaper to fine-tune or prompt than to retrieve style examples.
- Freshness suite — questions about documents added or edited after the last training cutoff. Fine-tuned-only stacks should score near zero here by design; RAG should score high.
Run evals on every corpus version and every adapter version independently. A perfect retriever with a fine-tuned model trained on old law will still fail freshness tests if the adapter memorized superseded caps.
Hybrid patterns that work in production
- RAG for facts + LoRA for format — Harbor Legal's end state: retrieve versioned clauses; adapter enforces heading structure and risk-flag vocabulary.
- Fine-tuned retriever + general LLM — train embedding or cross-encoder on domain query-passage pairs; keep generator frozen on a frontier model.
- Router model — small classifier sends stable-FAQ queries to a fine-tuned shortcut and volatile queries to full RAG.
- Semantic cache — cache answers keyed by embedding of normalized questions; invalidate on document version bump.
- Distillation from RAG traces — log retrieval + completion pairs; periodically distill into a smaller on-device model for offline or edge latency. Refresh when eval drift exceeds threshold.
Hybrids add operational complexity. Document which layer owns which failure mode in runbooks so on-call engineers do not retrain when they should re-index.
Harbor Legal contract assistant refactor
Harbor Legal replaced the monolithic fine-tune with a three-layer stack:
- Corpus — MSAs, SOWs, and playbooks chunked by
clause ID with
effective_dateandsupersedesmetadata; chunk boundaries aligned to legal section headers, not fixed token windows. - Retrieval — hybrid BM25 + dense search, cross-encoder rerank top 20 to 5, mandatory inclusion of active MSA version flag in filter.
- Adapter — LoRA on 3k examples for output skeleton only (summary, risk flags, recommended redlines); no clause text in training set.
- Guardrails — answers must include
source_clause_idcitations; validator rejects completions citing deprecated versions. - Ops — legal ops uploads revised PDF; ingestion pipeline completes within 2 hours; freshness eval suite runs automatically before traffic shift.
Citation accuracy on live agreements: 74% → 96%. Median time to reflect policy change: 18 days → 2.4 hours. Training spend per quarter dropped 62% while GPU inference cost rose 11% — net savings with higher trust.
Technique decision table
| Scenario | Fine-tune everything | RAG-first (optional thin adapter) |
|---|---|---|
| Weekly policy / pricing updates | Stale answers; expensive retrains | Preferred — versioned index + citations |
| SQL or DSL generation for fixed schema | Preferred — behavioral prior | RAG for schema docs only as supplement |
| Customer support on 10k help articles | Cannot memorize all articles reliably | Preferred — retrieve + cite |
| Consistent brand voice at scale | Preferred — small LoRA | Prompt often sufficient; RAG irrelevant |
| Regulated advice with audit trail | Weak provenance | Preferred — mandatory source attribution |
| Ultra-low latency on device | Preferred after distillation | RAG costly; distill from RAG logs periodically |
Common pitfalls
- Fine-tuning facts that change. Indemnity caps, SKUs, and interest rates belong in retrieval, not weights.
- RAG without eval. Low retrieval recall looks like a “dumb model” in triage.
- Giant context instead of RAG. Stuffing 200k tokens is not a substitute for search; cost and lost-in-the-middle errors return.
- Training on retrieved text. Leaks future retrieval into weights and confuses version boundaries.
- No ACL on chunks. RAG exposes any indexed document to any user with a clever query.
- Ignoring adapter drift. LoRA stays while corpus moves; style matches but disclaimers age.
- One-shot choice. Products evolve; plan hybrid migration paths in advance.
Production checklist
- Problem mapped to fact volatility, provenance needs, and format requirements.
- Decision doc states what lives in weights vs index vs prompt.
- Freshness eval suite with post-cutoff questions; run on every deploy.
- Retrieval recall@k and citation accuracy tracked separately from fluency.
- Corpus version metadata (
effective_date,supersedes) enforced at index time. - Chunking strategy aligned to document structure, not arbitrary token splits.
- Hybrid search and reranker tuned on real user queries, not synthetic only.
- Fine-tune datasets exclude volatile factual text you expect to change.
- Re-index playbook with SLA; retrain playbook with separate approval gate.
- Runbooks attribute failures to retrieval, generation, or adapter layer.
- Cost model updated quarterly: training amortization vs per-query retrieval tokens.
- Migration path documented if starting with wrong technique.
Key takeaways
- Fine-tuning shapes behavior; RAG supplies facts. Swapping one for the other fixes the wrong problem.
- Freshness is the usual tiebreaker. If truth changes faster than you ship models, retrieve.
- Hybrids are normal. Thin adapters plus versioned corpora beat religious purity.
- Eval separately. Retrieval recall and citation accuracy are not the same as loss curves.
- Provenance matters. Regulated and legal workflows need citations RAG provides natively.
Related reading
- LLM fine-tuning explained — SFT, LoRA, data prep and training loops
- RAG document ingestion explained — parsing, chunking and index pipelines
- RAG evaluation explained — retrieval metrics, faithfulness and benchmarks
- Agentic RAG explained — multi-step retrieval and tool-using research loops