Guide
DoRA explained: weight-decomposed low-rank adaptation
Harbor Legal’s contract-QA assistant needed to classify indemnity clauses into six liability buckets without drifting into generic chatbot tone. Engineers fine-tuned a 7B transformer with rank-16 LoRA adapters on all attention projections: 42 million trainable parameters, three epochs on 18,000 labeled clauses. Held-out accuracy plateaued at 74%. Errors clustered on “near-miss” clauses where the model confused cap-and-collar language with uncapped indemnity — cases where the semantic direction was right but activation magnitudes were too weak to flip the logits. Swapping LoRA for DoRA (Weight-Decomposed Low-Rank Adaptation) at the same rank and training budget lifted accuracy to 86% and cut false-positive uncapped predictions by 31%. No extra GPU memory, no longer training time.
DoRA keeps LoRA’s parameter efficiency but changes how the adapter modifies each weight matrix: instead of adding a single low-rank delta to frozen weights, it decomposes the effective weight into a learnable magnitude vector and a low-rank direction update. That split mirrors how full fine-tuning independently rescales and rotates feature channels — which LoRA approximates less faithfully. This guide covers the decomposition math, training and hyperparameter choices, QDoRA quantized training, the Harbor Legal refactor, a technique decision table versus LoRA and full fine-tuning, common pitfalls, and a production checklist.
What DoRA changes in the forward pass
Standard LoRA freezes a pretrained weight matrix W and learns a
low-rank product BA so the effective weight is
W' = W + (alpha/r) · BA. Magnitude and direction are
entangled: scaling the low-rank update changes both how strongly and in which
subspace the layer shifts.
DoRA instead writes the effective weight as:
W' = m ⊙ normalize(W + BA)
where m is a per-output-channel magnitude vector (same shape as
W’s row dimension), BA is the usual LoRA direction
term, and normalize divides each row by its L2 norm (with a small
epsilon). At initialization, m is set to the row norms of the frozen
W, so the forward pass matches the base model before training. Gradients
then flow separately into magnitude rescaling and directional rotation.
Intuition: many classification and extraction tasks need selective
amplification of existing features (magnitude) more than wholesale
rotation into a new subspace (direction). LoRA can only do both at once through
BA; DoRA unbundles them with one extra vector per output channel
— typically adding only ~0.01–0.1% more trainable parameters than
equivalent LoRA.
Training setup: rank, targets, and QDoRA
DoRA drops into the same PEFT-style training stacks as LoRA. Hyperparameters carry over with minor tuning notes:
- Rank (
r) — same guidance as LoRA: start at 8–16 for classification and extraction; raise to 32–64 only when underfitting persists after more data or epochs. - Alpha — scaling on the direction term; DoRA papers often
use
alpha = 2r. Harbor Legal usedr=16,alpha=32. - Target modules —
q_proj,k_proj,v_proj,o_projfor attention; addgate_proj,up_proj,down_projfor MLP when style or reasoning shifts need more capacity. - Learning rate — slightly lower than LoRA on the same task is common (Harbor used 1.5e-4 vs 2e-4 for LoRA) because magnitude gradients can be sharper early in training.
- Dropout — apply on the direction matrices as with LoRA; magnitude vectors typically use no dropout.
QDoRA combines DoRA with 4-bit quantized base weights (NF4) during training, analogous to QLoRA. Memory footprint matches QLoRA; Harbor trained on a single A10G with gradient checkpointing. Inference still benefits from merging or serving adapters through multi-LoRA serving stacks that support DoRA modules.
Harbor Legal clause-QA refactor
The baseline LoRA adapter learned correct semantic neighborhoods but under-confident logits on liability-cap edge cases. DoRA’s magnitude vector learned to up-weight attention channels that fire on numeric caps, carve-outs, and “except as provided” qualifiers without rotating unrelated syntax channels as aggressively.
| Metric | Rank-16 LoRA | Rank-16 DoRA |
|---|---|---|
| Held-out clause accuracy | 74.1% | 86.3% |
| False-positive uncapped indemnity | 18.2% | 12.5% |
| Trainable parameters | 41.9M | 42.7M (+1.9%) |
| Training wall time (3 epochs) | 4h 12m | 4h 19m |
| Peak VRAM (QDoRA, batch 8) | 19.4 GB | 19.6 GB |
Qualitative review: DoRA preserved base-model fluency on out-of-domain boilerplate better than a rank-32 LoRA trial that had matched DoRA accuracy but introduced stiff phrasing — supporting the claim that magnitude-direction separation reduces collateral damage to general capabilities.
When DoRA beats LoRA — and when it does not
| Technique | Best for | Trade-offs | Skip when |
|---|---|---|---|
| DoRA | Classification, extraction, structured output where logit calibration matters; tasks needing selective feature amplification | ~1–3% more params than LoRA; slightly more complex merge math; newer framework support | Framework lacks DoRA; trivial style tweaks where LoRA already saturates |
| LoRA | General-purpose adapters, multi-tenant serving with mature tooling, chat tone shifts | May underperform DoRA on fine-grained discrimination at equal rank | You need maximum ecosystem compatibility and DoRA gains are untested on your task |
| Full fine-tuning | Large domain shifts, new languages, multimodal heads | GPU cost, catastrophic forgetting risk, slow iteration | Budget and data favor PEFT; base model already strong |
| Prompt / ICL only | Rapid prototyping, low-data regimes, no training infra | Higher inference cost, context limits, less reliable formatting | You have labeled data and need consistent production behavior |
Deployment: merge, sidecar, and evaluation
Merging DoRA into base weights for single-adapter production follows the same pattern
as LoRA merge in Hugging Face PEFT: compute W' per layer and bake
into a new checkpoint. Magnitude vectors do not ship separately after merge.
For multi-tenant gateways, verify your inference engine lists DoRA among supported PEFT types before relying on dynamic adapter hot-swap. Evaluation should include:
- Task metrics — accuracy, F1, or field-level exact match on held-out sets stratified by difficulty.
- Calibration — expected calibration error (ECE) when thresholds drive automation; DoRA often improves this versus LoRA at equal rank.
- General capability regression — sample prompts from unrelated domains to catch overfitting or tone collapse.
- Latency — merged weights add zero inference overhead; sidecar adapters match LoRA latency.
Pair task evals with LLM-as-judge rubrics only for generative quality slices; classification tasks should rely on gold labels first.
Common pitfalls
- Assuming DoRA always beats LoRA — on broad chat-style SFT with abundant data, gains may be within noise; A/B at fixed rank before committing.
- Reusing LoRA learning rates blindly — magnitude gradients can destabilize early training; reduce LR 20–30% or warm up magnitudes over the first 5% of steps.
- Ignoring merge compatibility — older PEFT versions merge DoRA incorrectly; pin versions and verify merged output matches sidecar forward pass on golden inputs.
- Rank chasing instead of decomposition — doubling LoRA rank sometimes matches DoRA; compare fairly at equal parameter budgets.
- Quantization surprises at inference — merging DoRA into a model then quantizing to INT4 can clip magnitude rescaling; quantize-after-merge with calibration data from the target domain.
- Skipping row-norm epsilon — near-zero rows in sparse or pruned layers can explode normalization; use stable epsilons (1e-6 or framework default).
- Multi-adapter stacking — composing two DoRA adapters on the same layer is less tested than LoRA; prefer single merged adapters per task.
Production checklist
- Confirm PEFT / training framework version supports DoRA on your base model family.
- Run rank-matched LoRA baseline on the same data split before adopting DoRA.
- Set
r,alpha, target modules, and LR; enable QDoRA if VRAM-bound. - Initialize magnitude vectors from frozen row norms; verify step-0 loss matches base model.
- Track task accuracy, calibration, and out-of-domain regression each epoch.
- Compare trainable parameter count and wall time against LoRA at equal rank.
- Merge adapters for single-tenant deploy; validate merged vs sidecar logits on golden set.
- For multi-tenant serving, confirm inference engine DoRA support and adapter registry paths.
- Run post-merge quantization calibration if deploying INT4/INT8 weights.
- Document adapter version, training data hash, and eval metrics in your model registry.
- Plan refresh cadence when base model upgrades ship; DoRA adapters are not portable across unrelated checkpoints.
Key takeaways
- DoRA decomposes weight updates into learnable magnitude and low-rank direction terms.
- It adds minimal parameters over LoRA while often improving classification and extraction accuracy.
- QDoRA enables single-GPU training of 7B-class models with the same memory profile as QLoRA.
- Harbor Legal lifted clause-QA accuracy from 74% to 86% at rank 16 without extra VRAM.
- Always benchmark against rank-matched LoRA before assuming DoRA wins on generative chat tasks.
- Verify merge and serving stack compatibility before production rollout.
Related reading
- LoRA fine-tuning explained — low-rank adapters, QLoRA, and deployment patterns
- LLM fine-tuning explained — SFT workflow, data prep, and eval baselines
- LLM multi-LoRA serving explained — dynamic adapters and batch scheduling
- Instruction tuning (SFT) explained — demonstration data and supervised training