Guide

Neural network optimizers explained

After backpropagation computes gradients, an optimizer decides how to update each weight. That choice shapes whether training converges in hours or stalls for days, whether your model generalizes or memorizes, and whether fine-tuning a large language model requires delicate tuning or "just works" with defaults. Stochastic gradient descent (SGD) with momentum remains the workhorse for vision pretraining; Adam and especially AdamW dominate transformer and LLM training; older adaptive methods like AdaGrad and RMSprop still appear in niche contexts. This guide explains what each optimizer does mathematically, how learning rate schedules interact with them, when to prefer SGD over Adam, and how weight decay differs from L2 regularization in practice — with links to deep learning fundamentals, hyperparameter tuning, and validation discipline.

What an optimizer actually does

A neural network is a function with millions of parameters. Training minimizes a loss function — cross-entropy for classification, mean squared error for regression — by nudging weights in the direction that reduces loss. Backpropagation computes the partial derivative of the loss with respect to each weight: the gradient. The simplest update rule is:

w_new = w_old − η × gradient

Here η (eta) is the learning rate, the step size. Too large and training diverges — loss spikes or becomes NaN. Too small and convergence crawls, trapping the model in poor local minima or saddle points. Modern optimizers add memory (momentum), per-parameter scaling (adaptive methods), and scheduled η changes to navigate high-dimensional loss landscapes more reliably than vanilla gradient descent on the full dataset ever could.

Batch, mini-batch, and stochastic gradient descent

Full-batch gradient descent computes gradients over the entire training set before each update. It is stable but slow and memory-heavy at scale. Stochastic gradient descent (SGD) updates after every single example — fast but noisy. Mini-batch SGD is the practical compromise: process a batch of 32, 128, or 512 samples, compute average gradient, update, repeat. The noise from mini-batches acts as implicit regularization and helps escape sharp minima that generalize poorly. When practitioners say "SGD" today, they almost always mean mini-batch SGD.

Momentum and Nesterov accelerated gradient

Plain SGD struggles on loss surfaces with ravines — steep walls and shallow valleys where gradients oscillate across the narrow dimension while progress along the valley floor is slow. Momentum accumulates a velocity vector in the direction of recent gradients:

v = β × v + gradient
w = w − η × v

The hyperparameter β (typically 0.9) controls how much past gradients influence the current step. Momentum smooths oscillations and accelerates movement along consistent directions — like a ball rolling downhill gaining speed.

Nesterov accelerated gradient (NAG) looks ahead: it computes the gradient at the anticipated position after momentum, then corrects. In practice NAG often converges faster than standard momentum on convex problems. Frameworks like PyTorch expose both via SGD(momentum=0.9, nesterov=True).

When SGD + momentum still wins

For training ResNet and EfficientNet from scratch on ImageNet, SGD with momentum and a carefully tuned learning rate schedule often achieves slightly better generalization than Adam. Vision researchers frequently use SGD with momentum 0.9, weight decay 1e-4, and a step-decay or cosine schedule over 90–300 epochs. The trade-off: SGD requires more manual learning rate tuning; Adam is more forgiving out of the box.

Adaptive learning rate methods

Different parameters in a neural network receive gradients of vastly different magnitudes. Embedding layers, batch normalization scales, and final classification heads can need step sizes that differ by orders of magnitude. Adaptive optimizers maintain a per-parameter learning rate based on the history of gradients for that weight.

AdaGrad

AdaGrad accumulates the sum of squared gradients for each parameter and divides the learning rate by the square root of that sum. Parameters with large historical gradients get smaller effective steps; infrequent features in sparse data get larger steps. The problem: the accumulator grows monotonically, so the learning rate eventually shrinks toward zero and training stalls. AdaGrad is largely superseded but illustrates the core adaptive idea.

RMSprop

RMSprop fixes AdaGrad's shrinking learning rate by using an exponential moving average of squared gradients instead of a cumulative sum. It remains popular for RNNs and reinforcement learning where non-stationary gradients are common. The decay rate (often 0.9 or 0.99) controls how quickly old gradient magnitudes are forgotten.

Adam: adaptive moment estimation

Adam combines momentum (first moment of gradients) with RMSprop-style scaling (second moment of squared gradients). It maintains two running averages per parameter — mean and variance of gradients — and uses them to compute a bias-corrected update. Default hyperparameters (η = 1e-3, β1 = 0.9, β2 = 0.999) work surprisingly well across many tasks, which is why Adam became the default for transformers, GANs, and quick prototyping.

Adam's weakness: it can generalize slightly worse than well-tuned SGD on some vision tasks, and the original formulation couples weight decay incorrectly with the adaptive scaling — L2 penalty gets divided by the adaptive denominator, weakening regularization for parameters with large gradients.

AdamW: decoupled weight decay

AdamW fixes the weight decay problem by applying L2 regularization directly to the weights, decoupled from the gradient-based update. This matches how weight decay was intended to work in SGD and produces better generalization when training BERT, GPT-style models, and vision transformers. For LLM fine-tuning, AdamW with weight decay around 0.01 and a warmup schedule is the de facto standard. PyTorch's AdamW should be your default over Adam unless you have a specific reason not to use it.

Learning rate schedules

A fixed learning rate rarely works for the full training run. Early training benefits from exploration; late training needs small steps to settle into a good minimum. Learning rate schedules change η over time.

Common schedule patterns

  • Step decay — multiply η by 0.1 every N epochs (classic ImageNet recipe: drop at epochs 30, 60, 90).
  • Cosine annealing — smoothly decrease η following a cosine curve from peak to near zero. Popular for transformers and long training runs; often paired with a short restart (SGDR).
  • Linear warmup — ramp η from zero to peak over the first few hundred or thousand steps. Critical for transformer training where large initial steps destabilize attention layers.
  • One-cycle — increase η to a maximum then anneal down, sometimes with momentum inversion. Fast convergence for some vision tasks.
  • Reduce on plateau — monitor validation loss and cut η when improvement stalls. Useful when you do not know the optimal total epoch count.

The schedule and optimizer interact: Adam with a high default η plus no warmup can diverge on large transformers; SGD with cosine annealing and warmup is the standard recipe for training ViT from scratch. Treat learning rate and schedule as a single tuning decision — see hyperparameter search strategies for systematic approaches.

Optimizer comparison table

Optimizer Best for Pros Cons
SGD + momentum Vision pretraining (CNNs, ResNet) Strong generalization when tuned; simple Needs careful LR schedule; slower to tune
Adam Quick experiments, GANs, small models Works out of the box; fast initial convergence Weight decay coupling; may generalize worse than SGD on vision
AdamW Transformers, LLM fine-tuning, ViT Proper weight decay; default for modern NLP/LLM Can overfit small tabular datasets vs gradient boosting
RMSprop RNNs, RL policy gradients Handles non-stationary gradients Largely replaced by Adam in most domains
AdaGrad Sparse features (legacy) Per-feature scaling for rare inputs Learning rate collapses over time
Lion / Sophia / etc. Research / memory-constrained training Lower memory than Adam; emerging results Less mature tooling; not yet default anywhere

Weight decay vs L2 regularization

Both discourage large weights, but they are not identical in adaptive optimizers. L2 regularization adds a penalty term λ‖w‖² to the loss, so the gradient includes a 2λw term. Weight decay directly shrinks weights each step: w = w − η × gradient − η × λ × w. For vanilla SGD, they are equivalent. For Adam, L2 regularization gets scaled by the adaptive denominator and regularizes less aggressively on high-gradient parameters — which is often not what you want.

Use AdamW when you intend weight decay as regularization. Pair it with early stopping and validation monitoring from cross-validation best practices to catch overfitting before wasting GPU hours.

Gradient clipping and numerical stability

Recurrent networks, transformers on long sequences, and mixed-precision training can produce exploding gradients — single updates so large they corrupt weights irreversibly. Gradient clipping caps the global norm or per-value magnitude before the optimizer step. Transformer training almost always uses global norm clipping (max norm 1.0 is a common default). Clipping is orthogonal to optimizer choice: AdamW plus clipping is standard for LLM pretraining.

Watch for vanishing gradients in very deep networks without residual connections or proper initialization — no optimizer fixes a broken architecture. Batch normalization, layer normalization, and residual skip connections exist partly to keep gradient magnitudes in a healthy range for optimizers to work with.

Common mistakes

  • Using Adam instead of AdamW for transformer fine-tuning — you lose proper regularization unless you manually decouple.
  • Skipping learning rate warmup on large models — the first few hundred steps blow up loss or waste the run.
  • Tuning only the optimizer while ignoring batch size — effective learning rate scales with batch size in many recipes (linear scaling rule).
  • Copying ImageNet hyperparameters to a 500-sample medical dataset — small data usually needs smaller η, stronger regularization, and fewer epochs.
  • Comparing optimizers on one seed — stochasticity means run multiple seeds or use validation curves before declaring a winner.
  • Confusing loss plateau with wrong optimizer — data quality, label noise, and model capacity often matter more than SGD vs Adam.

Production checklist

  • Start with domain defaults — AdamW for transformers/LLMs, SGD+momentum for vision from scratch, check framework docs for your model class.
  • Log learning rate per step — confirm your schedule is actually applied (wandb, TensorBoard).
  • Use warmup for any model with attention or >50M parameters.
  • Enable gradient clipping for RNNs, transformers, and mixed-precision runs.
  • Separate weight decay from L2 — use AdamW, not Adam + L2, for intended regularization.
  • Scale learning rate with batch size when increasing parallelism (or use gradient accumulation to simulate larger batches).
  • Validate on a held-out set — optimizer choice should be judged by validation metric, not training loss.
  • Document optimizer config in experiment tracking for reproducibility alongside architecture and data version.

Key takeaways

  • Optimizers turn gradients into weight updates; the learning rate and schedule matter as much as the algorithm name.
  • SGD + momentum still excels for vision pretraining when paired with a tuned schedule; AdamW is the default for transformers and LLM fine-tuning.
  • Adaptive methods (Adam, RMSprop) scale per-parameter step sizes based on gradient history — faster convergence, different generalization trade-offs.
  • Weight decay in AdamW is decoupled from the adaptive update; do not assume L2 penalty and weight decay are interchangeable under Adam.
  • Warmup + cosine or step decay is standard for long training runs; monitor validation loss, not just training loss, when comparing optimizers.

Related reading