Guide

DoRA explained: weight-decomposed low-rank adaptation

Harbor Legal’s contract-QA assistant needed to classify indemnity clauses into six liability buckets without drifting into generic chatbot tone. Engineers fine-tuned a 7B transformer with rank-16 LoRA adapters on all attention projections: 42 million trainable parameters, three epochs on 18,000 labeled clauses. Held-out accuracy plateaued at 74%. Errors clustered on “near-miss” clauses where the model confused cap-and-collar language with uncapped indemnity — cases where the semantic direction was right but activation magnitudes were too weak to flip the logits. Swapping LoRA for DoRA (Weight-Decomposed Low-Rank Adaptation) at the same rank and training budget lifted accuracy to 86% and cut false-positive uncapped predictions by 31%. No extra GPU memory, no longer training time.

DoRA keeps LoRA’s parameter efficiency but changes how the adapter modifies each weight matrix: instead of adding a single low-rank delta to frozen weights, it decomposes the effective weight into a learnable magnitude vector and a low-rank direction update. That split mirrors how full fine-tuning independently rescales and rotates feature channels — which LoRA approximates less faithfully. This guide covers the decomposition math, training and hyperparameter choices, QDoRA quantized training, the Harbor Legal refactor, a technique decision table versus LoRA and full fine-tuning, common pitfalls, and a production checklist.

What DoRA changes in the forward pass

Standard LoRA freezes a pretrained weight matrix W and learns a low-rank product BA so the effective weight is W' = W + (alpha/r) · BA. Magnitude and direction are entangled: scaling the low-rank update changes both how strongly and in which subspace the layer shifts.

DoRA instead writes the effective weight as:

W' = m ⊙ normalize(W + BA)

where m is a per-output-channel magnitude vector (same shape as W’s row dimension), BA is the usual LoRA direction term, and normalize divides each row by its L2 norm (with a small epsilon). At initialization, m is set to the row norms of the frozen W, so the forward pass matches the base model before training. Gradients then flow separately into magnitude rescaling and directional rotation.

Intuition: many classification and extraction tasks need selective amplification of existing features (magnitude) more than wholesale rotation into a new subspace (direction). LoRA can only do both at once through BA; DoRA unbundles them with one extra vector per output channel — typically adding only ~0.01–0.1% more trainable parameters than equivalent LoRA.

Training setup: rank, targets, and QDoRA

DoRA drops into the same PEFT-style training stacks as LoRA. Hyperparameters carry over with minor tuning notes:

  • Rank (r) — same guidance as LoRA: start at 8–16 for classification and extraction; raise to 32–64 only when underfitting persists after more data or epochs.
  • Alpha — scaling on the direction term; DoRA papers often use alpha = 2r. Harbor Legal used r=16, alpha=32.
  • Target modulesq_proj, k_proj, v_proj, o_proj for attention; add gate_proj, up_proj, down_proj for MLP when style or reasoning shifts need more capacity.
  • Learning rate — slightly lower than LoRA on the same task is common (Harbor used 1.5e-4 vs 2e-4 for LoRA) because magnitude gradients can be sharper early in training.
  • Dropout — apply on the direction matrices as with LoRA; magnitude vectors typically use no dropout.

QDoRA combines DoRA with 4-bit quantized base weights (NF4) during training, analogous to QLoRA. Memory footprint matches QLoRA; Harbor trained on a single A10G with gradient checkpointing. Inference still benefits from merging or serving adapters through multi-LoRA serving stacks that support DoRA modules.

Harbor Legal clause-QA refactor

The baseline LoRA adapter learned correct semantic neighborhoods but under-confident logits on liability-cap edge cases. DoRA’s magnitude vector learned to up-weight attention channels that fire on numeric caps, carve-outs, and “except as provided” qualifiers without rotating unrelated syntax channels as aggressively.

Metric Rank-16 LoRA Rank-16 DoRA
Held-out clause accuracy 74.1% 86.3%
False-positive uncapped indemnity 18.2% 12.5%
Trainable parameters 41.9M 42.7M (+1.9%)
Training wall time (3 epochs) 4h 12m 4h 19m
Peak VRAM (QDoRA, batch 8) 19.4 GB 19.6 GB

Qualitative review: DoRA preserved base-model fluency on out-of-domain boilerplate better than a rank-32 LoRA trial that had matched DoRA accuracy but introduced stiff phrasing — supporting the claim that magnitude-direction separation reduces collateral damage to general capabilities.

When DoRA beats LoRA — and when it does not

Technique Best for Trade-offs Skip when
DoRA Classification, extraction, structured output where logit calibration matters; tasks needing selective feature amplification ~1–3% more params than LoRA; slightly more complex merge math; newer framework support Framework lacks DoRA; trivial style tweaks where LoRA already saturates
LoRA General-purpose adapters, multi-tenant serving with mature tooling, chat tone shifts May underperform DoRA on fine-grained discrimination at equal rank You need maximum ecosystem compatibility and DoRA gains are untested on your task
Full fine-tuning Large domain shifts, new languages, multimodal heads GPU cost, catastrophic forgetting risk, slow iteration Budget and data favor PEFT; base model already strong
Prompt / ICL only Rapid prototyping, low-data regimes, no training infra Higher inference cost, context limits, less reliable formatting You have labeled data and need consistent production behavior

Deployment: merge, sidecar, and evaluation

Merging DoRA into base weights for single-adapter production follows the same pattern as LoRA merge in Hugging Face PEFT: compute W' per layer and bake into a new checkpoint. Magnitude vectors do not ship separately after merge.

For multi-tenant gateways, verify your inference engine lists DoRA among supported PEFT types before relying on dynamic adapter hot-swap. Evaluation should include:

  • Task metrics — accuracy, F1, or field-level exact match on held-out sets stratified by difficulty.
  • Calibration — expected calibration error (ECE) when thresholds drive automation; DoRA often improves this versus LoRA at equal rank.
  • General capability regression — sample prompts from unrelated domains to catch overfitting or tone collapse.
  • Latency — merged weights add zero inference overhead; sidecar adapters match LoRA latency.

Pair task evals with LLM-as-judge rubrics only for generative quality slices; classification tasks should rely on gold labels first.

Common pitfalls

  • Assuming DoRA always beats LoRA — on broad chat-style SFT with abundant data, gains may be within noise; A/B at fixed rank before committing.
  • Reusing LoRA learning rates blindly — magnitude gradients can destabilize early training; reduce LR 20–30% or warm up magnitudes over the first 5% of steps.
  • Ignoring merge compatibility — older PEFT versions merge DoRA incorrectly; pin versions and verify merged output matches sidecar forward pass on golden inputs.
  • Rank chasing instead of decomposition — doubling LoRA rank sometimes matches DoRA; compare fairly at equal parameter budgets.
  • Quantization surprises at inference — merging DoRA into a model then quantizing to INT4 can clip magnitude rescaling; quantize-after-merge with calibration data from the target domain.
  • Skipping row-norm epsilon — near-zero rows in sparse or pruned layers can explode normalization; use stable epsilons (1e-6 or framework default).
  • Multi-adapter stacking — composing two DoRA adapters on the same layer is less tested than LoRA; prefer single merged adapters per task.

Production checklist

  • Confirm PEFT / training framework version supports DoRA on your base model family.
  • Run rank-matched LoRA baseline on the same data split before adopting DoRA.
  • Set r, alpha, target modules, and LR; enable QDoRA if VRAM-bound.
  • Initialize magnitude vectors from frozen row norms; verify step-0 loss matches base model.
  • Track task accuracy, calibration, and out-of-domain regression each epoch.
  • Compare trainable parameter count and wall time against LoRA at equal rank.
  • Merge adapters for single-tenant deploy; validate merged vs sidecar logits on golden set.
  • For multi-tenant serving, confirm inference engine DoRA support and adapter registry paths.
  • Run post-merge quantization calibration if deploying INT4/INT8 weights.
  • Document adapter version, training data hash, and eval metrics in your model registry.
  • Plan refresh cadence when base model upgrades ship; DoRA adapters are not portable across unrelated checkpoints.

Key takeaways

  • DoRA decomposes weight updates into learnable magnitude and low-rank direction terms.
  • It adds minimal parameters over LoRA while often improving classification and extraction accuracy.
  • QDoRA enables single-GPU training of 7B-class models with the same memory profile as QLoRA.
  • Harbor Legal lifted clause-QA accuracy from 74% to 86% at rank 16 without extra VRAM.
  • Always benchmark against rank-matched LoRA before assuming DoRA wins on generative chat tasks.
  • Verify merge and serving stack compatibility before production rollout.

Related reading