Guide

LLM fine-tuning vs RAG explained

Harbor Legal's contract-review assistant launched after a six-week fine-tuning sprint on 40,000 annotated indemnity and limitation-of-liability clauses. Lawyers loved the tone — crisp, cautious, always in passive voice — and early accuracy on held-out clauses hit 91%. Three weeks later, the firm adopted a new master services agreement with a $2M indemnity cap. The bot kept citing the old $500k language with equal confidence. Retraining would take another GPU week, legal sign-off, and a regression pass nobody had scheduled. Partners stopped trusting citations; usage fell 58% in a month.

The rebuild kept a thin LoRA adapter for style and task format but moved all factual clause text into a RAG corpus with version tags and source attribution. Policy updates now ship as index refreshes in hours, not retraining cycles in weeks. Citation accuracy on live agreements rose from 74% to 96%; hallucinated clause numbers dropped 83%. This guide covers what fine-tuning and RAG each change in the model stack, a decision taxonomy by problem type, cost and freshness tradeoffs, evaluation design, hybrid patterns, the Harbor Legal refactor, a technique decision table versus “fine-tune everything,” pitfalls, and a production checklist.

What each technique actually changes

Teams argue about fine-tuning versus RAG because the labels sound interchangeable in slide decks. They operate on different layers.

Dimension	Fine-tuning (SFT / LoRA / full)	RAG (retrieval + prompt)
What moves	Model weights (behavior, style, format priors)	External index (documents, embeddings, metadata)
Knowledge location	Compressed into parameters; hard to inspect per fact	Explicit chunks with IDs, versions, and ACLs
Update cadence	Retrain or adapter refresh; slow, needs eval gates	Re-index documents; fast if pipeline exists
Provenance	Opaque; model “knows” without citation	Natural fit for footnotes and source links
Typical cost driver	GPU training, data labeling, regression suites	Embedding + vector DB + retrieval latency
Failure mode	Stale memorized facts; catastrophic forgetting on narrow data	Missed retrieval; context stuffing; wrong chunk ranking

Fine-tuning teaches how to behave. RAG supplies what is true right now. Conflating the two leads to the Harbor Legal failure mode: beautiful answers about facts that left the building months ago.

Decision taxonomy by problem type

Map your use case to the primary lever before debating vendors.

Problem	Prefer	Why
Volatile factual knowledge (policies, prices, laws)	RAG	Facts change faster than you can retrain; citations required
Stable domain reasoning (SQL dialect, API schema shape)	Fine-tuning	Behavioral prior beats retrieving examples every call
Brand voice and output format	Fine-tuning or prompt	Small LoRA or strong system prompt; RAG does not fix tone alone
Long-tail entity lookup (SKUs, ticket IDs, employee directory)	RAG	Exact-match retrieval scales; fine-tuning cannot memorize millions of rows
Tool-calling discipline and JSON shape	Fine-tuning + schema	Structured outputs and SFT on tool traces beat hoping retrieval teaches syntax
Proprietary reasoning on public base model	Fine-tuning	Trade secrets in weights may be acceptable; never put secrets only in RAG without ACLs
Multi-hop synthesis across many documents	RAG + agent	Agentic RAG with query decomposition; fine-tuning alone does not add new pages at runtime
Latency-sensitive high QPS FAQ	Hybrid	Cache top questions; RAG for long tail; optional small fine-tuned router

Freshness, cost, and latency

Knowledge half-life

Estimate how often ground truth changes. If policy docs update weekly, any full fine-tune is already obsolete at deploy. RAG index pipelines should target time-to-live under one business day for regulated content. If rules are stable for years (chess notation, legacy COBOL patterns), fine-tuning on curated examples is cheaper per query at scale.

Total cost of ownership

Fine-tuning bills once per version in GPU hours, labeler time, and eval reruns. RAG bills continuously in embedding API calls, vector storage, and larger prompt tokens per request. Break even depends on query volume: at 50k requests/day, a $8k fine-tune that removes 2k tokens of retrieved context per call often pays back in weeks. At 500 requests/day, RAG plus prompt engineering wins without training overhead.

Latency budget

Retrieval adds 50–400 ms depending on hybrid search, rerankers, and chunk count. Fine-tuned models skip that hop but may need longer generations if the answer was not internalized. Measure P95 end-to-end; a fast wrong fine-tuned answer is not cheaper than a slower cited RAG answer in compliance workflows.

Evaluation design

You cannot choose fine-tuning versus RAG without metrics that separate behavior from facts:

Factuality / citation accuracy — does each claim match an authorized source? RAG-heavy systems score this directly; fine-tuned models need human or LLM-judge checks against a gold document set.
Retrieval recall@k — is the correct chunk in the top-k context? Failures here look like model hallucinations but are search bugs.
Format compliance — JSON schema, section headers, mandatory disclaimers. Fine-tuning and constrained decoding excel; RAG does not fix format drift alone.
Style rubric — tone, reading level, brevity. Often cheaper to fine-tune or prompt than to retrieve style examples.
Freshness suite — questions about documents added or edited after the last training cutoff. Fine-tuned-only stacks should score near zero here by design; RAG should score high.

Run evals on every corpus version and every adapter version independently. A perfect retriever with a fine-tuned model trained on old law will still fail freshness tests if the adapter memorized superseded caps.

Hybrid patterns that work in production

RAG for facts + LoRA for format — Harbor Legal's end state: retrieve versioned clauses; adapter enforces heading structure and risk-flag vocabulary.
Fine-tuned retriever + general LLM — train embedding or cross-encoder on domain query-passage pairs; keep generator frozen on a frontier model.
Router model — small classifier sends stable-FAQ queries to a fine-tuned shortcut and volatile queries to full RAG.
Semantic cache — cache answers keyed by embedding of normalized questions; invalidate on document version bump.
Distillation from RAG traces — log retrieval + completion pairs; periodically distill into a smaller on-device model for offline or edge latency. Refresh when eval drift exceeds threshold.

Hybrids add operational complexity. Document which layer owns which failure mode in runbooks so on-call engineers do not retrain when they should re-index.

Harbor Legal contract assistant refactor

Harbor Legal replaced the monolithic fine-tune with a three-layer stack:

Corpus — MSAs, SOWs, and playbooks chunked by clause ID with effective_date and supersedes metadata; chunk boundaries aligned to legal section headers, not fixed token windows.
Retrieval — hybrid BM25 + dense search, cross-encoder rerank top 20 to 5, mandatory inclusion of active MSA version flag in filter.
Adapter — LoRA on 3k examples for output skeleton only (summary, risk flags, recommended redlines); no clause text in training set.
Guardrails — answers must include source_clause_id citations; validator rejects completions citing deprecated versions.
Ops — legal ops uploads revised PDF; ingestion pipeline completes within 2 hours; freshness eval suite runs automatically before traffic shift.

Citation accuracy on live agreements: 74% → 96%. Median time to reflect policy change: 18 days → 2.4 hours. Training spend per quarter dropped 62% while GPU inference cost rose 11% — net savings with higher trust.

Technique decision table

Scenario	Fine-tune everything	RAG-first (optional thin adapter)
Weekly policy / pricing updates	Stale answers; expensive retrains	Preferred — versioned index + citations
SQL or DSL generation for fixed schema	Preferred — behavioral prior	RAG for schema docs only as supplement
Customer support on 10k help articles	Cannot memorize all articles reliably	Preferred — retrieve + cite
Consistent brand voice at scale	Preferred — small LoRA	Prompt often sufficient; RAG irrelevant
Regulated advice with audit trail	Weak provenance	Preferred — mandatory source attribution
Ultra-low latency on device	Preferred after distillation	RAG costly; distill from RAG logs periodically

Common pitfalls

Fine-tuning facts that change. Indemnity caps, SKUs, and interest rates belong in retrieval, not weights.
RAG without eval. Low retrieval recall looks like a “dumb model” in triage.
Giant context instead of RAG. Stuffing 200k tokens is not a substitute for search; cost and lost-in-the-middle errors return.
Training on retrieved text. Leaks future retrieval into weights and confuses version boundaries.
No ACL on chunks. RAG exposes any indexed document to any user with a clever query.
Ignoring adapter drift. LoRA stays while corpus moves; style matches but disclaimers age.
One-shot choice. Products evolve; plan hybrid migration paths in advance.

Production checklist

Problem mapped to fact volatility, provenance needs, and format requirements.
Decision doc states what lives in weights vs index vs prompt.
Freshness eval suite with post-cutoff questions; run on every deploy.
Retrieval recall@k and citation accuracy tracked separately from fluency.
Corpus version metadata (effective_date, supersedes) enforced at index time.
Chunking strategy aligned to document structure, not arbitrary token splits.
Hybrid search and reranker tuned on real user queries, not synthetic only.
Fine-tune datasets exclude volatile factual text you expect to change.
Re-index playbook with SLA; retrain playbook with separate approval gate.
Runbooks attribute failures to retrieval, generation, or adapter layer.
Cost model updated quarterly: training amortization vs per-query retrieval tokens.
Migration path documented if starting with wrong technique.

Key takeaways

Fine-tuning shapes behavior; RAG supplies facts. Swapping one for the other fixes the wrong problem.
Freshness is the usual tiebreaker. If truth changes faster than you ship models, retrieve.
Hybrids are normal. Thin adapters plus versioned corpora beat religious purity.
Eval separately. Retrieval recall and citation accuracy are not the same as loss curves.
Provenance matters. Regulated and legal workflows need citations RAG provides natively.