Guide
Mixed precision training explained
Harbor Analytics' edge defect classifier was stalling at batch size 32
on a single A10G — each epoch took 47 minutes and the team could
not fit a larger ResNet-50 without cutting image resolution. Switching
from pure FP32 to mixed precision training with BF16
forward passes and FP32 master weights cut peak VRAM by 38%, doubled
throughput on Tensor Cores, and let them train at batch size 64 with
identical validation F1 (0.891 vs 0.890). Mixed precision
training runs most matrix math in 16-bit floats while keeping
sensitive operations and weight updates in FP32. The result is faster
training, lower memory use, and often the same final accuracy —
provided you handle loss scaling, gradient underflow, and the handful of
layers that must stay in full precision. This guide explains FP32 vs
FP16 vs BF16, how NVIDIA Tensor Cores accelerate 16-bit GEMMs, dynamic
loss scaling with PyTorch autocast and
GradScaler, a Harbor Analytics worked example, a format
decision table, pitfalls, and a production checklist.
Why full FP32 is wasteful
Standard deep learning historically used 32-bit IEEE floats (FP32) for everything: activations, weights, gradients, and optimizer states. FP32 gives ~7 decimal digits of precision — far more than most neural network layers need for stable training. The cost is memory bandwidth and compute: every tensor element occupies 4 bytes, and pre-Tensor-Core GPUs could not exploit the narrower format for speedups.
Modern NVIDIA GPUs (Volta and later) include Tensor Cores that execute mixed-precision matrix multiply-accumulate in a single clock cycle at much higher throughput than FP32 CUDA cores. Training in 16-bit for the heavy ops while accumulating in FP32 is the standard recipe for large vision and language models. The technique is not a hack — it is how most production PyTorch and JAX training loops run today.
What “mixed” means in practice
Mixed precision does not mean every tensor is FP16. A typical training step:
- Forward pass: convolutions, linear layers, and attention matmuls run in FP16 or BF16 inside an autocast region.
- Loss-sensitive ops: softmax, layer norm reductions, and loss computation often stay in FP32 for numerical stability.
- Backward pass: gradients are computed in mixed precision but may be scaled before the optimizer step.
- Weight update: master weights live in FP32; the optimizer reads FP32 copies and writes back, then casts to 16-bit for the next forward pass.
This split is why mixed precision rarely changes final model quality when configured correctly — the high-precision accumulator and master weights absorb rounding error from the fast path.
FP16 vs BF16 vs FP32
Both FP16 and BF16 use 16 bits per value, but they trade precision for range differently:
| Format | Exponent bits | Mantissa bits | Dynamic range | Typical use |
|---|---|---|---|---|
| FP32 | 8 | 23 | Very wide | Master weights, loss, optimizer states |
| FP16 (half) | 5 | 10 | Narrow (~6e-5 to 65504) | Legacy AMP on V100; needs loss scaling |
| BF16 (bfloat16) | 8 | 7 | Same exponent as FP32 | Default on A100/H100; often no loss scaling |
When BF16 wins
BF16 (brain float) keeps FP32's 8-bit exponent,
so activations and gradients rarely overflow or underflow to zero.
On Ampere-and-newer GPUs, BF16 Tensor Core paths are well optimized.
Most new training code defaults to torch.bfloat16 on
supported hardware.
When FP16 still matters
FP16 has more mantissa bits than BF16, so it can be
slightly more precise for small-magnitude values — but the
narrow exponent range causes gradient underflow without
loss scaling. FP16 remains common on older V100
clusters and in inference stacks where BF16 kernels are unavailable.
Pair FP16 training with dynamic loss scaling (below) and monitor for
inf / nan in gradients.
Loss scaling and gradient underflow
FP16 gradients can round to zero when their magnitude falls below the representable range — the same vanishing gradient problem, but caused by numeric format rather than architecture. Loss scaling multiplies the loss by a large factor (e.g. 216) before backprop so gradients are computed at a larger magnitude, then divides them back before the optimizer step.
Dynamic vs static scaling
Static loss scaling uses a fixed multiplier chosen by
trial. It is simple but brittle — too small and gradients still
underflow; too large and you get overflow (inf).
Dynamic loss scaling (PyTorch
GradScaler) starts high, backs off when overflows are
detected, and grows again when steps are clean. This is the default
for FP16 AMP and requires almost no manual tuning.
With BF16, loss scaling is usually unnecessary because
the exponent range matches FP32. Many teams run BF16 autocast with
GradScaler disabled and see stable training on transformers
and CNNs alike.
PyTorch AMP pattern
scaler = torch.cuda.amp.GradScaler() # FP16; skip for BF16
for batch in loader:
optimizer.zero_grad(set_to_none=True)
with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
logits = model(batch)
loss = criterion(logits, labels)
scaler.scale(loss).backward() # or loss.backward() for BF16
scaler.unscale_(optimizer)
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
scaler.step(optimizer)
scaler.update()
Key details: wrap only the forward + loss in autocast;
keep the optimizer step outside (or use scaler.step which
unscale internally). Use
AdamW or your chosen optimizer
on FP32 master params — PyTorch AMP handles the cast
automatically when using GradScaler with FP16.
What to keep in FP32
Not every op benefits from or tolerates 16-bit math. PyTorch's autocast maintains a whitelist/blacklist; you should also know the common manual exceptions:
- Loss functions with log-sum-exp (cross-entropy, CTC) — often more stable in FP32.
- Layer normalization and batch norm reductions — autocast typically runs these in FP32 internally; do not force-cast without testing.
- Small reduction dimensions — mean/variance over few elements accumulate error in FP16.
- Exponential moving averages for EMA model weights — keep EMA buffers in FP32.
- Embedding tables with rare tokens — large vocabulary gradients can be noisy in FP16; BF16 is usually fine.
For transformer training at scale, specialized kernels like FlashAttention fuse attention in mixed precision with careful accumulation order. If you write custom CUDA, match the accumulation dtype to FP32 even when inputs are BF16.
Worked example: Harbor Analytics defect classifier
Harbor Analytics trains a ResNet-50 on 512×512 factory camera frames to detect surface defects. Baseline FP32 training on one A10G (24 GB): batch 32, ~47 min/epoch, peak VRAM 21.4 GB.
Migration steps
- Enable
torch.autocast(dtype=torch.bfloat16)around forward + loss; noGradScaler. - Verify
model.cuda()and input tensors are on GPU; keeppin_memory=Trueon DataLoader. - Increase batch size to 64; enable
torch.backends.cudnn.benchmark = Truefor fixed input sizes. - Compare validation F1 for 5 epochs before full run; watch for divergence in first 500 steps.
- Log
torch.cuda.max_memory_allocated()per epoch for regression tests.
Results
- Throughput: 47 min/epoch → 23 min/epoch (~2.0× speedup).
- VRAM: 21.4 GB → 13.2 GB peak at batch 64.
- Quality: validation F1 0.891 (BF16) vs 0.890 (FP32) after 30 epochs — within noise.
- Failure mode caught early: first FP16 attempt (without BF16 hardware path) showed F1 drop to 0.84 at epoch 8 due to gradient underflow in the final classifier layer; switching to BF16 fixed it without loss scaling.
Harbor now defaults to BF16 for all convolutional training and reserves FP32-only runs for regression baselines when auditing new architectures.
Format decision table
| Your situation | Favored approach | Caution |
|---|---|---|
| Ampere/Hopper GPU (A100, H100, A10G) | BF16 autocast, no GradScaler | Verify GPU supports BF16 Tensor Cores |
| V100 or older without BF16 | FP16 autocast + dynamic GradScaler | Monitor overflow skips; tune initial scale if needed |
| LLM fine-tuning (LoRA/QLoRA) | BF16 or FP16 per base model docs | Quantized bases may require specific dtype pairing |
| Small model, CPU-only training | Stay FP32 | No Tensor Core benefit; AMP adds complexity |
| Custom loss with extreme log magnitudes | BF16 autocast; compute loss in FP32 | Force loss.float() before backward |
| Reproducibility audits | FP32 baseline + seeded BF16 comparison | Non-deterministic Tensor Core ops may differ slightly |
Pitfalls
- Autocast everywhere. Wrapping the optimizer or gradient clipping inside autocast causes subtle bugs. Keep optimizer steps in full precision.
- Forgetting master weights. Training directly in FP16 without FP32 copies lets weight updates round away small changes over thousands of steps.
- FP16 on BF16 hardware without scaling. Teams copy old FP16 recipes onto A100s, see NaNs, and blame the model. Switch to BF16 first.
- Batch norm in eval mismatch. Running BN stats computed in mixed precision without syncing can shift inference behavior. Validate eval mode F1 after AMP migration.
- Gradient clipping order. With GradScaler, call
unscale_beforeclip_grad_norm_or clipping operates on scaled gradients. - False speedups on tiny models. Overhead from casting dominates when the model fits in L2 cache. Profile before assuming 2× gains.
- Checkpoint dtype surprises. Saving only FP16 weights loses precision; save FP32 state dicts or full optimizer checkpoints for resume fidelity.
Production checklist
- Confirm GPU supports BF16 or plan FP16 + GradScaler.
- Wrap forward and loss in
torch.autocastwith chosendtype. - Keep optimizer, scheduler, and EMA buffers in FP32.
- For FP16: enable dynamic loss scaling; log scaler skips.
- Run 3–5 epoch parity check against FP32 baseline metrics.
- Log
max_memory_allocatedand steps/sec each epoch. - Validate eval-mode inference matches training dtype policy.
- Save checkpoints in FP32 or full training state, not FP16-only.
- Document dtype in experiment tracking for reproducibility.
- Re-profile after architecture changes — AMP speedup varies by op mix.
Key takeaways
- Mixed precision trains most ops in 16-bit and updates weights in FP32 — faster math, lower memory, same accuracy when configured correctly.
- BF16 is the modern default on Ampere+ — wide exponent range often eliminates loss scaling.
- FP16 needs dynamic loss scaling to prevent gradient underflow on older GPUs.
- Tensor Cores drive the speedup — CPU training and tiny models see little benefit.
- Always parity-check metrics before committing production training pipelines to mixed precision.
Related reading
- PyTorch fundamentals explained — tensors, autograd, training loops, and CUDA setup
- Neural network optimizers explained — AdamW, learning rates, and weight decay with AMP
- Vanishing and exploding gradients explained — gradient health beyond numeric format
- FlashAttention explained — memory-efficient attention kernels in mixed precision