Guide

Mixed precision training explained

Harbor Analytics' edge defect classifier was stalling at batch size 32 on a single A10G — each epoch took 47 minutes and the team could not fit a larger ResNet-50 without cutting image resolution. Switching from pure FP32 to mixed precision training with BF16 forward passes and FP32 master weights cut peak VRAM by 38%, doubled throughput on Tensor Cores, and let them train at batch size 64 with identical validation F1 (0.891 vs 0.890). Mixed precision training runs most matrix math in 16-bit floats while keeping sensitive operations and weight updates in FP32. The result is faster training, lower memory use, and often the same final accuracy — provided you handle loss scaling, gradient underflow, and the handful of layers that must stay in full precision. This guide explains FP32 vs FP16 vs BF16, how NVIDIA Tensor Cores accelerate 16-bit GEMMs, dynamic loss scaling with PyTorch autocast and GradScaler, a Harbor Analytics worked example, a format decision table, pitfalls, and a production checklist.

Why full FP32 is wasteful

Standard deep learning historically used 32-bit IEEE floats (FP32) for everything: activations, weights, gradients, and optimizer states. FP32 gives ~7 decimal digits of precision — far more than most neural network layers need for stable training. The cost is memory bandwidth and compute: every tensor element occupies 4 bytes, and pre-Tensor-Core GPUs could not exploit the narrower format for speedups.

Modern NVIDIA GPUs (Volta and later) include Tensor Cores that execute mixed-precision matrix multiply-accumulate in a single clock cycle at much higher throughput than FP32 CUDA cores. Training in 16-bit for the heavy ops while accumulating in FP32 is the standard recipe for large vision and language models. The technique is not a hack — it is how most production PyTorch and JAX training loops run today.

What “mixed” means in practice

Mixed precision does not mean every tensor is FP16. A typical training step:

Forward pass: convolutions, linear layers, and attention matmuls run in FP16 or BF16 inside an autocast region.
Loss-sensitive ops: softmax, layer norm reductions, and loss computation often stay in FP32 for numerical stability.
Backward pass: gradients are computed in mixed precision but may be scaled before the optimizer step.
Weight update: master weights live in FP32; the optimizer reads FP32 copies and writes back, then casts to 16-bit for the next forward pass.

This split is why mixed precision rarely changes final model quality when configured correctly — the high-precision accumulator and master weights absorb rounding error from the fast path.

FP16 vs BF16 vs FP32

Both FP16 and BF16 use 16 bits per value, but they trade precision for range differently:

Format	Exponent bits	Mantissa bits	Dynamic range	Typical use
FP32	8	23	Very wide	Master weights, loss, optimizer states
FP16 (half)	5	10	Narrow (~6e-5 to 65504)	Legacy AMP on V100; needs loss scaling
BF16 (bfloat16)	8	7	Same exponent as FP32	Default on A100/H100; often no loss scaling

When BF16 wins

BF16 (brain float) keeps FP32's 8-bit exponent, so activations and gradients rarely overflow or underflow to zero. On Ampere-and-newer GPUs, BF16 Tensor Core paths are well optimized. Most new training code defaults to torch.bfloat16 on supported hardware.

When FP16 still matters

FP16 has more mantissa bits than BF16, so it can be slightly more precise for small-magnitude values — but the narrow exponent range causes gradient underflow without loss scaling. FP16 remains common on older V100 clusters and in inference stacks where BF16 kernels are unavailable. Pair FP16 training with dynamic loss scaling (below) and monitor for inf / nan in gradients.

Loss scaling and gradient underflow

FP16 gradients can round to zero when their magnitude falls below the representable range — the same vanishing gradient problem, but caused by numeric format rather than architecture. Loss scaling multiplies the loss by a large factor (e.g. 2¹⁶) before backprop so gradients are computed at a larger magnitude, then divides them back before the optimizer step.

Dynamic vs static scaling

Static loss scaling uses a fixed multiplier chosen by trial. It is simple but brittle — too small and gradients still underflow; too large and you get overflow (inf). Dynamic loss scaling (PyTorch GradScaler) starts high, backs off when overflows are detected, and grows again when steps are clean. This is the default for FP16 AMP and requires almost no manual tuning.

With BF16, loss scaling is usually unnecessary because the exponent range matches FP32. Many teams run BF16 autocast with GradScaler disabled and see stable training on transformers and CNNs alike.

PyTorch AMP pattern

scaler = torch.cuda.amp.GradScaler()  # FP16; skip for BF16

for batch in loader:
    optimizer.zero_grad(set_to_none=True)
    with torch.autocast(device_type="cuda", dtype=torch.bfloat16):
        logits = model(batch)
        loss = criterion(logits, labels)
    scaler.scale(loss).backward()       # or loss.backward() for BF16
    scaler.unscale_(optimizer)
    torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
    scaler.step(optimizer)
    scaler.update()

Key details: wrap only the forward + loss in autocast; keep the optimizer step outside (or use scaler.step which unscale internally). Use AdamW or your chosen optimizer on FP32 master params — PyTorch AMP handles the cast automatically when using GradScaler with FP16.

What to keep in FP32

Not every op benefits from or tolerates 16-bit math. PyTorch's autocast maintains a whitelist/blacklist; you should also know the common manual exceptions:

Loss functions with log-sum-exp (cross-entropy, CTC) — often more stable in FP32.
Layer normalization and batch norm reductions — autocast typically runs these in FP32 internally; do not force-cast without testing.
Small reduction dimensions — mean/variance over few elements accumulate error in FP16.
Exponential moving averages for EMA model weights — keep EMA buffers in FP32.
Embedding tables with rare tokens — large vocabulary gradients can be noisy in FP16; BF16 is usually fine.

For transformer training at scale, specialized kernels like FlashAttention fuse attention in mixed precision with careful accumulation order. If you write custom CUDA, match the accumulation dtype to FP32 even when inputs are BF16.

Worked example: Harbor Analytics defect classifier

Harbor Analytics trains a ResNet-50 on 512×512 factory camera frames to detect surface defects. Baseline FP32 training on one A10G (24 GB): batch 32, ~47 min/epoch, peak VRAM 21.4 GB.

Migration steps

Enable torch.autocast(dtype=torch.bfloat16) around forward + loss; no GradScaler.
Verify model.cuda() and input tensors are on GPU; keep pin_memory=True on DataLoader.
Increase batch size to 64; enable torch.backends.cudnn.benchmark = True for fixed input sizes.
Compare validation F1 for 5 epochs before full run; watch for divergence in first 500 steps.
Log torch.cuda.max_memory_allocated() per epoch for regression tests.

Results

Throughput: 47 min/epoch → 23 min/epoch (~2.0× speedup).
VRAM: 21.4 GB → 13.2 GB peak at batch 64.
Quality: validation F1 0.891 (BF16) vs 0.890 (FP32) after 30 epochs — within noise.
Failure mode caught early: first FP16 attempt (without BF16 hardware path) showed F1 drop to 0.84 at epoch 8 due to gradient underflow in the final classifier layer; switching to BF16 fixed it without loss scaling.

Harbor now defaults to BF16 for all convolutional training and reserves FP32-only runs for regression baselines when auditing new architectures.

Format decision table

Your situation	Favored approach	Caution
Ampere/Hopper GPU (A100, H100, A10G)	BF16 autocast, no GradScaler	Verify GPU supports BF16 Tensor Cores
V100 or older without BF16	FP16 autocast + dynamic GradScaler	Monitor overflow skips; tune initial scale if needed
LLM fine-tuning (LoRA/QLoRA)	BF16 or FP16 per base model docs	Quantized bases may require specific dtype pairing
Small model, CPU-only training	Stay FP32	No Tensor Core benefit; AMP adds complexity
Custom loss with extreme log magnitudes	BF16 autocast; compute loss in FP32	Force `loss.float()` before backward
Reproducibility audits	FP32 baseline + seeded BF16 comparison	Non-deterministic Tensor Core ops may differ slightly

Pitfalls

Autocast everywhere. Wrapping the optimizer or gradient clipping inside autocast causes subtle bugs. Keep optimizer steps in full precision.
Forgetting master weights. Training directly in FP16 without FP32 copies lets weight updates round away small changes over thousands of steps.
FP16 on BF16 hardware without scaling. Teams copy old FP16 recipes onto A100s, see NaNs, and blame the model. Switch to BF16 first.
Batch norm in eval mismatch. Running BN stats computed in mixed precision without syncing can shift inference behavior. Validate eval mode F1 after AMP migration.
Gradient clipping order. With GradScaler, call unscale_ before clip_grad_norm_ or clipping operates on scaled gradients.
False speedups on tiny models. Overhead from casting dominates when the model fits in L2 cache. Profile before assuming 2× gains.
Checkpoint dtype surprises. Saving only FP16 weights loses precision; save FP32 state dicts or full optimizer checkpoints for resume fidelity.

Production checklist

Confirm GPU supports BF16 or plan FP16 + GradScaler.
Wrap forward and loss in torch.autocast with chosen dtype.
Keep optimizer, scheduler, and EMA buffers in FP32.
For FP16: enable dynamic loss scaling; log scaler skips.
Run 3–5 epoch parity check against FP32 baseline metrics.
Log max_memory_allocated and steps/sec each epoch.
Validate eval-mode inference matches training dtype policy.
Save checkpoints in FP32 or full training state, not FP16-only.
Document dtype in experiment tracking for reproducibility.
Re-profile after architecture changes — AMP speedup varies by op mix.

Key takeaways

Mixed precision trains most ops in 16-bit and updates weights in FP32 — faster math, lower memory, same accuracy when configured correctly.
BF16 is the modern default on Ampere+ — wide exponent range often eliminates loss scaling.
FP16 needs dynamic loss scaling to prevent gradient underflow on older GPUs.
Tensor Cores drive the speedup — CPU training and tiny models see little benefit.
Always parity-check metrics before committing production training pipelines to mixed precision.