Guide

LLM fine-tuning explained: LoRA, QLoRA, and when training beats prompting

A base language model knows grammar, reasoning patterns, and broad world knowledge from pre-training — but it does not know your product vocabulary, your support tone, or how to emit JSON your API expects. Fine-tuning updates model weights on a curated dataset so behavior becomes consistent without pasting a ten-page system prompt every request. It is powerful and expensive; many teams reach for it too early when retrieval-augmented generation (RAG) or better prompting would suffice. This guide explains what fine-tuning actually changes, the main techniques (full fine-tuning, LoRA, QLoRA, preference tuning), how to prepare data and evaluate results, and a practical decision framework for when training is worth the GPU bill.

What fine-tuning changes

Pre-training teaches a model to predict the next token across trillions of tokens scraped from the public internet. Supervised fine-tuning (SFT) continues that process on a smaller, task-specific dataset: instruction-response pairs, code completions, support tickets with ideal replies, or structured extraction examples. The model adjusts its internal representations so certain patterns — your brand voice, a classification label set, a function-calling schema — become the default rather than something you fight for in every prompt.

Fine-tuning does not magically inject facts that were never in training data. If your knowledge base changes daily, weights frozen at deploy time go stale. That is why production systems often combine fine-tuning (for behavior) with RAG (for facts). Fine-tuning also does not replace safety review: a model tuned on toxic examples will reproduce them faster than a base model with the same prompt guardrails.

Fine-tuning vs prompting vs RAG

These three approaches solve different problems and stack cleanly:

Prompting — zero weight changes; fastest to iterate. Works when the base model already has the skill and you only need format, few-shot examples, or chain-of-thought scaffolding. Cost is per-token at inference; long system prompts eat into your context window budget.
RAG — retrieves relevant documents at query time and injects them into the prompt. Best when answers must cite fresh or private data (docs, tickets, on-chain events) that were not in pre-training. Retrieval quality, not model weights, is usually the bottleneck.
Fine-tuning — bakes in recurring patterns: tone, output structure, domain jargon, tool-selection habits, classification boundaries. Best when the same behavioral fix would otherwise require thousands of tokens of prompt engineering on every call.

Rule of thumb: try prompting first, add RAG when facts are missing or stale, fine-tune when behavior is consistently wrong despite good prompts and retrieved context. Preference tuning (DPO, RLHF) sits on top of SFT when you need the model to choose between acceptable and unacceptable completions — useful for chat safety and alignment, not for memorizing product specs.

Full fine-tuning vs parameter-efficient methods

Full fine-tuning

Updates every weight in the model. Maximum flexibility — you can reshape capabilities deeply — but requires multiple high-memory GPUs (often 8×80 GB for a 7B model in fp16, more at 70B scale), risks catastrophic forgetting of general skills, and produces a full model copy per experiment. Reserved for teams with serious ML infra or when LoRA plateaus.

LoRA (Low-Rank Adaptation)

Freezes the base model and trains small adapter matrices inserted into attention layers. A rank-r decomposition might add only 0.1–1% of the base parameter count while capturing most task-specific signal. You ship a base model plus a few-megabyte LoRA adapter; swap adapters per customer or task without reloading 14 GB of weights. Training fits on a single consumer GPU for 7B models.

QLoRA

Quantizes the frozen base to 4-bit (NF4) during training while adapters stay in higher precision. Cuts VRAM roughly in half vs LoRA alone — a 7B QLoRA run often fits in 24 GB — at a small quality trade-off. The default starting point for most indie and startup fine-tunes in 2026.

Other variants

Adapters (bottleneck layers), prefix tuning (learned prompt embeddings), and IA³ (scaling activations) pursue the same goal: change behavior without touching every weight. LoRA dominates open-source tooling (Hugging Face PEFT, Axolotl, Unsloth) because of its strong quality-to-cost ratio and easy merging at inference time.

When fine-tuning is the right call

Fine-tuning pays off when all of these are true:

The failure mode is behavioral — wrong format, wrong tone, wrong tool choice — not missing facts.
You can collect or label hundreds to low thousands of high-quality examples (more for harder tasks; quality beats quantity).
The pattern is stable — it will not change next week when your docs update.
You will run the model often enough that a shorter prompt plus a smaller adapter beats a giant system prompt on every request.

Skip fine-tuning when: your data changes constantly (use RAG), you have fewer than ~200 good examples (use few-shot prompting), the base model already nails the task with minor prompt tweaks, or you cannot evaluate regressions on a held-out set. Autonomous agents that call tools should also harden against prompt injection — fine-tuning on adversarial examples helps, but is not a substitute for input validation and least-privilege tool scopes.

Data preparation

Garbage in, garbage weights. A fine-tune dataset is usually JSON or JSONL rows with fields like instruction, input, and output — or a multi-turn messages array matching your inference API format. Critical practices:

Match production format exactly. If inference uses tool-call blocks, training rows must include the same tags and schemas.
Deduplicate and decontaminate. Near-duplicate rows overweight certain phrases; leaking eval examples into train destroys your metrics.
Balance classes. A 95% "no escalation" label set teaches a model that never escalates.
Include hard negatives. Show correct refusals, ambiguous cases, and edge inputs the model will see live.
Human review a random sample. One mislabeled row repeated 500 times is a feature, not noise.

Synthetic data from a larger teacher model can bootstrap volume, but audit it — teachers hallucinate confidently. Mix synthetic with human-verified gold rows; cap synthetic share unless you measure no regression on human-only eval.

Training workflow and hyperparameters

A typical LoRA/QLoRA pipeline:

Choose a base model sized for your latency and quality bar (7B for many apps, 70B only if you can afford inference).
Tokenize dataset; pack or pad to a fixed max_seq_length (2048–4096 common).
Set LoRA rank r (8–64), alpha (often 2× rank), target modules (usually q_proj, v_proj, sometimes all linear).
Learning rate ~1e-4 to 3e-4 for LoRA; fewer epochs than full fine-tuning (1–3 passes often enough).
Watch eval loss on a held-out set; stop early when it flatlines or train loss diverges from eval (overfitting).

Log everything: seed, base model revision, dataset hash, hyperparameters, and eval scores. Without reproducibility you cannot tell whether a regression came from data, code, or the base model vendor silently updating weights. Merge LoRA into base for simplest deployment, or serve adapter sidecars if you multitenant per-customer tunings.

Evaluation beyond loss curves

Training loss going down does not mean users are happier. Build an eval set of real prompts with human-graded or rule-checked expected outputs:

Task accuracy — exact match, JSON schema validity, or fuzzy match on key fields.
LLM-as-judge — a stronger model scores rubric dimensions (helpfulness, tone); useful at scale but calibrate against human labels.
Regression suite — prompts where the base model already worked; fine-tune must not break them.
Safety probes — jailbreak attempts, PII extraction, tool abuse scenarios.

Compare base + prompt, base + RAG, and fine-tuned variants on the same eval set. A 2% gain on your metric at 10× inference cost is a product decision, not an ML win. Track latency and token usage — fine-tunes that shorten required prompts can pay for themselves.

Deployment, cost, and maintenance

Serving a fine-tuned model means either hosting weights yourself (vLLM, TGI, Ollama) or uploading to a managed fine-tuning API (OpenAI, Together, Fireworks, etc.). Self-host gives control and predictable unit economics at scale; managed APIs reduce ops but charge per token and may restrict custom architectures.

Budget for retraining: product copy changes, new tools, or base model upgrades (Llama 3 → 3.1) can obsolete an adapter. Version adapters alongside your app; keep a rollback path. Monitor production drift — rising thumbs-down rate or tool error rate often means data distribution shifted, not that users got worse overnight.

Common mistakes

Fine-tuning to memorize facts — use RAG; weights are a bad database.
Too few epochs on too little data — overfits; model parrots training rows on novel inputs.
Mismatched chat template — training with ChatML but inferring with a different special-token layout breaks silently.
Skipping regression eval — model gets better at your task and worse at everything else.
Ignoring inference cost — a merged 70B fine-tune nobody can afford to run is shelfware.
No safety pass after tuning — SFT on user-generated data can amplify toxicity unless filtered.

Key takeaways

Fine-tuning shapes behavior; RAG supplies facts; prompting is the fastest experiment loop.
LoRA/QLoRA is the practical default — full fine-tuning only when adapters are not enough.
Dataset quality and format alignment matter more than exotic hyperparameter sweeps.
Evaluate on held-out real prompts, including regressions and safety probes, not just loss.
Plan for adapter versioning, base model upgrades, and production drift monitoring.
Combine with RAG and injection defenses for agents that touch live data or tools.