Guide

QLoRA explained: 4-bit quantized LoRA fine-tuning

Harbor Analytics needed a 13B-parameter model to translate natural-language warehouse questions into executable SQL against a 400-table schema. Full fine-tuning in bf16 would require roughly 52 GB just for weights, plus optimizer states — impossible on the team's single 40 GB A100. Standard LoRA cut trainable parameters to 0.8% but still loaded the frozen backbone in fp16, peaking at 31 GB before activations. QLoRA (Quantized LoRA) stored the frozen backbone in 4-bit NF4, trained bf16 adapter matrices on top, and used a paged optimizer so peak VRAM landed at 28 GB — leaving headroom for batch size 4 and gradient checkpointing. Execution accuracy on held-out queries rose from 71% (prompt-only) to 94%; the adapter checkpoint shipped at 47 MB instead of 26 GB.

QLoRA is not a different adapter architecture. It is a memory engineering stack around LoRA: 4-bit NormalFloat quantization for frozen weights, double quantization of scale constants, on-the-fly dequantization during the forward pass, and paged AdamW to stop optimizer state from spiking memory. This guide walks through each layer, training configuration, the Harbor Analytics SQL refactor, a technique decision table versus bf16 LoRA and full fine-tuning, common pitfalls, and a production checklist.

Where the memory goes without QLoRA

Fine-tuning a 13B model in bf16 loads roughly 26 GB of base weights. AdamW stores two fp32 states per trainable parameter — negligible for LoRA adapters (tens of millions of params) but irrelevant when the base is frozen. The real cost is keeping the full backbone resident in high precision while activations and gradients flow through attention blocks during the backward pass.

Inference-only 4-bit quantization (GPTQ, AWQ) shrinks deployment size but does not by itself enable training: you still need differentiable paths into the layers you update. QLoRA's insight is to quantize only the frozen backbone, keep LoRA matrices in bf16/fp16 where gradients are stable, and dequantize weight blocks to fp16 at compute time inside each linear layer forward pass.

Component bf16 LoRA (13B base) QLoRA (13B base)
Frozen base weights ~26 GB (bf16) ~6.5 GB (NF4 + scales)
LoRA adapter weights ~180 MB ~180 MB (bf16)
Optimizer states (adapters only) ~720 MB ~720 MB (paged)
Activations (batch 4, seq 2048, checkpointed) ~4–6 GB ~4–6 GB
Typical peak VRAM ~31 GB ~28 GB

The gap widens on 34B and 70B models where bf16 LoRA exceeds consumer and many cloud single-GPU limits entirely. QLoRA was the technique that first made 65B-class adapter training feasible on one 48 GB card.

NF4, double quantization, and the forward pass

4-bit NormalFloat (NF4) is a codebook tuned to the distribution of neural network weights (centered near zero with heavy tails). Each 64-weight block stores a single fp16 scale factor; weights index into 16 NF4 levels. That block-wise design limits outlier damage compared with naive min-max INT4.

Double quantization applies the same idea to the scale constants themselves: quantize fp16 block scales down to 8-bit and store a per-block scale-of-scale. On a 13B model this saves hundreds of megabytes — small relative to weights but meaningful at the margin when every gigabyte determines batch size.

During forward pass, each linear layer:

  1. Dequantizes the frozen weight block from NF4 to fp16/bf16 in registers.
  2. Computes h = Wx with the dequantized weight.
  3. Adds the LoRA path: h += (alpha/r) · BAx in bf16.
  4. Discards the dequantized copy; only NF4 storage persists in memory.

Gradients flow into LoRA matrices A and B only. Frozen quantized weights receive no optimizer updates. Backprop through the dequantized matmul is standard; the quantization itself is treated as a fixed lookup during training (no straight-through estimator on NF4 indices in the default bitsandbytes stack).

Paged optimizers and training configuration

Even with a quantized base, AdamW moment buffers spike when the optimizer allocates contiguous GPU tensors for adapter gradients. Paged AdamW (via bitsandbytes) pages optimizer state to CPU and copies blocks to GPU only during the update step — similar in spirit to OS virtual memory. Combined with gradient checkpointing, this is what lets Harbor run batch 4 at 2048 context on a single A100.

Practical hyperparameters for QLoRA training:

  • Quantization configload_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=bfloat16 (or fp16 on older GPUs).
  • LoRA rank — same guidance as standard LoRA: start r=16, alpha=32; raise rank only when underfitting persists after more data or epochs.
  • Target modules — attention projections (q_proj, k_proj, v_proj, o_proj) for most tasks; add MLP layers for heavy reasoning or format shifts.
  • Learning rate — 1e-4 to 2e-4 with cosine decay; Harbor used 1.8e-4 on SQL generation.
  • Sequence length — truncate or pack to the longest 95th-percentile example; QLoRA does not reduce activation memory from context length.

For even better discrimination on classification tasks, pair QLoRA training with DoRA (QDoRA) once the memory stack is stable — same NF4 backbone, different adapter geometry.

Harbor Analytics SQL adapter refactor

The baseline was Llama-2-13B with schema-in-context prompting: 71% of generated queries executed without error on a held-out set of 1,200 natural-language questions. Common failures: wrong join keys across fact tables, hallucinated column names, and date-filter off-by-one errors on fiscal calendars.

Harbor fine-tuned with QLoRA on 14,000 (question, gold SQL) pairs covering the warehouse schema. Three epochs, rank 16, all attention and MLP projections targeted, gradient checkpointing enabled.

Metric Prompt-only 13B QLoRA rank-16 adapter
Execution accuracy (held-out) 71.2% 94.1%
Exact SQL string match 38.4% 61.7%
Peak VRAM during training N/A (no training) 28.3 GB
Adapter checkpoint size 47 MB
Training wall time (3 epochs, 1× A100) 6h 48m

Production serving merged the adapter into a fp16 base for latency, then deployed through their existing multi-LoRA gateway as a tenant-specific sidecar for A/B against the prompt baseline. Merge added no inference overhead versus a native fp16 fine-tune.

When QLoRA wins — and when to skip it

Technique Best for Trade-offs Skip when
QLoRA 7B–70B adapter training on 1–2 GPUs; teams without multi-node clusters; rapid iteration on domain adapters Tiny quality gap vs bf16 LoRA on some generative tasks; bitsandbytes dependency; merge step required for lowest latency Model fits comfortably in bf16 LoRA with desired batch size; framework lacks 4-bit training hooks
bf16 LoRA 7B models on 24–48 GB GPUs where memory is not the bottleneck; maximum adapter fidelity 2–4× more base-weight memory than QLoRA VRAM already tight at 13B+ scale
Full fine-tuning Large domain shifts, new modalities, pretraining continuation Multi-GPU cost, catastrophic forgetting, slow iteration Labeled data and task fit LoRA; budget constrained
RAG only Knowledge-heavy Q&A where weights need not change Retrieval latency, chunk quality dependency Task needs output format or reasoning style baked into weights (SQL, code, tone)

Deployment: merge, quantize, and evaluate

Training finishes with a small adapter folder plus the frozen 4-bit base in memory. Production paths:

  • Merge into fp16/bf16 base — bake adapter deltas into a standard checkpoint; deploy like any fine-tuned model. Best for single-tenant low-latency serving.
  • Sidecar adapter — keep one shared quantized base, hot-swap LoRA weights per tenant. Pairs with multi-LoRA inference engines.
  • Merge then quantize for inference — GPTQ/AWQ on the merged weights for deployment at INT4; run calibration on domain prompts after merge, not on the raw pretrained base alone.

Evaluation should mirror SFT best practice: held-out task metrics first, then spot-check general capability regression on unrelated prompts. For SQL and code, execution accuracy beats string match. For chat tone, pair human review with LLM-as-judge rubrics.

Common pitfalls

  • Expecting QLoRA to shrink activation memory — context length and batch size dominate activation footprint; use gradient checkpointing, packing, or shorter sequences separately.
  • Wrong compute dtype — fp16 compute on Ampere-and-newer GPUs often underperforms bf16 for training stability; match bnb_4bit_compute_dtype to hardware support.
  • Skipping bf16 LoRA baseline on 7B — when memory allows, compare QLoRA vs bf16 LoRA on the same split; the gap is task-dependent and sometimes zero.
  • Quantizing adapters at training time — keep LoRA weights in bf16/fp16; quantizing adapters during training usually hurts convergence.
  • Merge-then-quantize without calibration — post-merge INT4 needs domain representative samples or accuracy drops on tail queries.
  • Outdated bitsandbytes / PEFT pins — NF4 and double-quant bugs appear across version pairs; lock versions and run golden forward-pass tests after upgrades.
  • Training on already GPTQ weights — QLoRA expects the standard NF4 training path; re-quantizing arbitrary checkpoints can silently distort weight distributions.

Production checklist

  • Confirm GPU VRAM budget; estimate peak with base size, rank, batch, and seq length.
  • Install pinned bitsandbytes, PEFT, and transformers versions known to work together.
  • Load base with NF4 + double quant + bf16 compute dtype.
  • Attach LoRA to target modules; verify trainable param count is under 1–2% of base.
  • Enable gradient checkpointing and paged AdamW (or 8-bit Adam if paged unavailable).
  • Run step-0 loss check against frozen base to catch config errors.
  • Track task metrics each epoch; hold out stratified difficult examples.
  • Compare against prompt-only and (if feasible) bf16 LoRA baselines.
  • Merge adapter for production or register sidecar path in serving gateway.
  • Run post-merge calibration if deploying INT4/INT8 inference weights.
  • Document adapter version, training data hash, quant config, and eval metrics in model registry.

Key takeaways

  • QLoRA stores frozen weights in 4-bit NF4 while training bf16 LoRA adapters.
  • Double quantization and paged optimizers recover enough VRAM to fine-tune 13B+ models on one GPU.
  • It is a memory stack, not a new adapter math — rank, alpha, and targets follow standard LoRA guidance.
  • Harbor Analytics lifted SQL execution accuracy from 71% to 94% with a 47 MB adapter on one A100.
  • Merge for latency or sidecar for multi-tenant; calibrate if you quantize again at inference.
  • Benchmark against bf16 LoRA when VRAM allows before assuming QLoRA is mandatory.

Related reading