Guide
LoRA fine-tuning explained
Full fine-tuning a 7B-parameter language model means updating billions of weights, renting multi-GPU nodes, and shipping a new 14 GB checkpoint every time you tweak support tone or JSON output format. LoRA (Low-Rank Adaptation) takes a different path: freeze the pretrained transformer backbone and train only small adapter matrices inserted alongside existing linear layers. A rank-16 LoRA on Llama-class models often adds under 1% of trainable parameters yet captures most task-specific behavior. Pair that with QLoRA — 4-bit quantized base weights during training — and a single 24 GB GPU can fine-tune models that full fine-tuning would never fit. This guide explains the low-rank math, hyperparameter choices (rank, alpha, target modules), training and evaluation workflow, merging versus sidecar deployment, a Harbor Support tone-adapter worked example, a method decision table, common pitfalls, and a production checklist alongside our broader LLM fine-tuning overview and quantization guide.
What LoRA changes — and what stays frozen
In a standard linear layer, a weight matrix W maps an input vector
x to an output h = Wx. During full fine-tuning, every
element of W receives gradient updates. LoRA hypothesizes that the
task-specific delta ΔW has low intrinsic
rank: most of the behavioral shift you want (tone, format, classification
boundary) can be expressed as a product of two thin matrices rather than a dense
update to all of W.
Concretely, LoRA replaces the forward pass with:
h = Wx + (alpha / r) · BAx
where B has shape (output_dim, r),
A has shape (r, input_dim), and r is the
rank — typically 4 to 64. The base W stays
frozen in fp16 or bf16; only A and B train. At
initialization, A uses a small random draw and B is zero,
so the adapter starts as a no-op and training gradually injects signal without
shocking the pretrained representation.
Why low rank works for LLMs
Attention and MLP layers in large models already encode broad linguistic and reasoning priors. Adapting them to “always reply in Harbor Support voice” or “emit valid JSON matching schema X” is a comparatively low-dimensional shift in activation space. Empirically, ranks of 8–32 capture most SFT gains on 7B–13B models; pushing rank to 128 helps only on harder multi-skill blends or when you are distilling a much larger teacher. If eval loss plateaus early, try more data quality before cranking rank.
Hyperparameters: rank, alpha, and target modules
Rank (r)
Higher rank increases adapter capacity and parameter count linearly. Rule of thumb:
start at r=16 for single-task SFT (classification, tone, format);
use r=32 or r=64 when blending multiple behaviors or
fine-tuning on code with diverse repositories. Monitor train versus eval loss —
if train drops but eval rises, rank may be too high for your dataset size (classic
overfitting).
Alpha scaling
The scalar alpha scales the adapter contribution relative to the frozen
weights. Most recipes set alpha = 2 × r (e.g., alpha 32 when
r=16). When you merge the adapter into base weights for deployment,
the effective scale is alpha / r; mismatched alpha after merge is a
common source of “the merged model looks nothing like eval.”
Which layers to target
Minimum viable targets: q_proj and v_proj in each attention
block. Stronger default: all four attention projections
(q_proj, k_proj, v_proj, o_proj).
For format-heavy or reasoning tasks, add MLP gates
(gate_proj, up_proj, down_proj) — roughly
doubles trainable params but often improves JSON and tool-call reliability. Embedding
and LM head LoRA is rarely needed unless you are adding a large volume of domain
tokens (chemical names, on-chain addresses) the tokenizer already represents poorly.
Dropout and learning rate
LoRA dropout of 0.05–0.1 regularizes small datasets. Learning rates 1e-4 to 3e-4 work for most 7B runs; use 5e-5 for 70B or when continuing from an already-fine-tuned base. Fewer epochs than full fine-tuning — often one to three passes — because adapters converge quickly and overfit fast on <1k rows.
QLoRA: fine-tuning when VRAM is the bottleneck
QLoRA stores the frozen base model in 4-bit NormalFloat (NF4) quantization while keeping LoRA adapters in bf16/fp16. Gradients flow through the quantized weights via straight-through estimators; only adapter parameters update. Memory savings are dramatic: a 7B QLoRA run often fits in 16–24 GB versus 40+ GB for fp16 LoRA on the same model.
Trade-offs: slight quality loss versus fp16 LoRA, especially on tasks needing precise numerics or rare-token spelling. Always compare merged QLoRA output against a fp16 LoRA baseline on your eval set before declaring victory on cost savings. For inference, merge adapters into a quantized deployment artifact (GPTQ, AWQ) rather than serving NF4 training weights directly unless your stack supports bitsandbytes inference natively.
Double quantization
QLoRA optionally quantizes the quantization constants themselves (“double quant”), shaving another few hundred megabytes with negligible quality impact. Enable it when you are VRAM-constrained on 24 GB cards training 13B models.
Training workflow end to end
- Pick a base model — instruction-tuned chat models (Llama-Instruct, Mistral-Instruct, Qwen-Instruct) outperform raw base models for dialogue SFT.
- Format data to match inference — same chat template, special tokens, and tool-call blocks you will use in production. Mismatch here is the #1 silent failure mode.
- Tokenize and pack — set
max_seq_lengthto cover 95th percentile example length; mask loss on user turns if you only want the assistant to learn. - Configure PEFT — libraries like Hugging Face PEFT, Axolotl, and Unsloth wrap rank, alpha, targets, and dropout in a few lines.
- Train with eval checkpoints — save every N steps; pick the checkpoint with best held-out loss, not the final step.
- Run regression evals — exact-match on structured outputs, LLM-judge on tone, and a small human gold set for high-stakes replies.
- Export — adapter-only safetensors (few MB) plus metadata JSON listing base model revision, rank, alpha, and targets.
Log dataset hash, seed, base model commit, and hyperparameters. Without reproducibility you cannot tell whether a regression came from data drift or a vendor silently updating the base weights on Hugging Face.
Deployment: merge, sidecar, or multi-adapter
Merged weights
Bake ΔW = (alpha/r) BA into each targeted linear layer and save one
model directory. Simplest for single-tenant deployments and fastest inference (no
extra matmul at runtime). Downside: every adapter variant needs a full model copy
on disk and in GPU memory.
Sidecar adapters
Load one shared base and hot-swap LoRA files per tenant, locale, or task. vLLM, TGI, and llama.cpp support multiple LoRA modules with modest overhead. Ideal for SaaS where thousands of customers each have a 20 MB tone adapter on the same 7B base.
Stacking and composition
Some frameworks allow loading multiple adapters (e.g., base + safety + domain). Order and scaling matter; test combined behavior — two benign adapters can interact unpredictably. Prefer a single merged adapter trained on blended data when interactions are hard to predict.
Worked example: Harbor Support tone adapter
Harbor Support routes 40,000 monthly tickets through a 7B instruct model. The base model answers correctly but sounds generic — no empathy markers, inconsistent escalation phrasing, and occasional markdown tables when the CRM expects plain text.
The team collects 2,400 human-reviewed ticket–reply pairs from top agents,
deduplicated and balanced across issue types. They train QLoRA with
r=16, alpha=32, targets on all attention projections plus
MLP layers, max_seq_length=4096, two epochs, lr=2e-4. Eval: 200 held-out
tickets scored by agents blind to model variant.
Results: tone adherence rises from 62% to 91% on the human rubric; factual accuracy (checked against knowledge-base citations via RAG) is unchanged because LoRA did not replace retrieval — it only shaped how retrieved facts are phrased. They deploy a sidecar adapter in vLLM so EU and US tenants can load locale-specific LoRA files on the same base without duplicating 14 GB weights. Total training cost: one A10G for six hours versus an estimated eight A100-hours for full fine-tuning that risked catastrophic forgetting on general reasoning benchmarks.
Method decision table
| Approach | Trainable params | Typical VRAM (7B) | Best for |
|---|---|---|---|
| Prompting only | 0 | Inference only | Quick experiments, stable base model already capable |
| RAG | 0 | Inference + index | Fresh or private facts; cite sources |
| LoRA / QLoRA | 0.1–3% of base | 16–40 GB | Tone, format, tool habits, stable behavioral patterns |
| Full fine-tuning | 100% | Multi-GPU clusters | Deep capability shifts when LoRA plateaus and budget allows |
| DPO / RLHF on LoRA | Adapter only | Similar to LoRA SFT | Preference alignment after SFT — see RLHF guide |
Common pitfalls
- Chat template mismatch — training with one tokenizer template and inferring with another produces gibberish or empty replies.
- Training on prompts you mask at inference — if loss includes user tokens the model should never generate, quality suffers.
- Chasing rank before data — 500 noisy rows at r=128 overfits; clean 800 rows at r=16 often wins.
- Expecting LoRA to inject facts — weights memorize slowly and go stale; pair with RAG for knowledge.
- Skipping base-model version pins — adapters are not portable across arbitrary base revisions.
- Merging with wrong alpha — deployment model diverges from the checkpoint you evaluated.
- No adversarial eval — tone adapters can be jailbroken; combine with prompt-injection defenses.
- Evaluating only loss — cross-entropy can improve while human-rated helpfulness drops.
Production checklist
- Confirm base model revision hash matches training metadata.
- Match chat template, special tokens, and tool schemas between train and serve.
- Start LoRA at r=16; increase rank only with eval evidence.
- Hold out 10–15% human-reviewed eval rows; never tune on them.
- Compare QLoRA vs fp16 LoRA on your hardest eval slice before cutting VRAM corners.
- Log dataset version, hyperparameters, and checkpoint step for every deploy.
- Run regression suite on general capabilities (math, coding) to catch forgetting.
- Choose merge vs sidecar based on tenant count and adapter churn.
- Pair behavioral LoRA with RAG when answers need current facts.
- Document rollback: keep previous adapter and base pair for one release cycle.
Key takeaways
- LoRA trains thin adapter matrices B and A while freezing pretrained weights W.
- Rank and alpha control capacity and scaling; target attention plus MLP for format-heavy tasks.
- QLoRA makes 7B SFT feasible on a single consumer GPU with modest quality trade-offs.
- Deploy merged weights for simplicity or sidecar adapters for multi-tenant SaaS.
- LoRA shapes behavior; RAG and prompting still own facts and quick iteration.
Related reading
- LLM fine-tuning explained — when to train versus prompt or retrieve, data prep, and evaluation framework
- Transformer architecture explained — attention blocks where LoRA adapters attach
- LLM quantization and inference explained — NF4 training versus GPTQ/AWQ deployment
- RLHF explained — preference tuning on top of LoRA SFT checkpoints