Guide
Learning rate scheduling explained
The learning rate controls how far each gradient step moves weights during deep learning training. Too high and loss oscillates or diverges; too low and training crawls toward a mediocre minimum. A learning rate schedule changes that step size over time — typically starting larger to explore the loss landscape, then shrinking to settle into a sharp minimum. Modern optimizers like AdamW adapt per-parameter steps, but a global schedule still matters: transformer pretraining relies on warmup plus cosine decay, computer vision models use step decay, and practitioners lean on ReduceLROnPlateau when validation loss stalls. This guide explains why fixed rates fail, the major schedule families, how warmup interacts with large batches and transformers, how to combine schedules with hyperparameter tuning, and a decision table for picking the right strategy.
Why a constant learning rate is rarely enough
Neural network loss surfaces are high-dimensional and non-convex. Early in training, large gradients point toward broad basins — a higher learning rate helps escape poor initializations and saddle regions quickly. Late in training, the same step size overshoots fine structure near the minimum, causing validation loss to bounce or plateau. Schedules encode the intuition that exploration should dominate early and refinement late.
Mini-batch noise adds another dimension: a rate that works at batch size 32 may diverge at batch size 4,096 unless you scale the rate (linear scaling rule) and add warmup. Without scheduling, you often end up with a compromise rate that is suboptimal for both phases — slower convergence and worse final accuracy than a well-chosen decay curve.
Core concepts: steps, epochs, and base learning rate
Schedules are defined over optimizer steps (one per mini-batch) or epochs (one full pass through the training set). PyTorch and TensorFlow schedulers usually count steps; Keras often uses epochs. Always confirm which unit your framework uses before comparing papers.
The base learning rate (lr₀) is the peak or
initial value the schedule modulates. Finding lr₀ is itself a
hyperparameter — learning rate range tests and one-cycle policies automate
part of that search. The schedule outputs a multiplier or absolute rate
lr(t) at step t.
Step decay and exponential decay
Step decay multiplies the learning rate by a factor
γ (often 0.1) every N epochs or steps. ResNet and
many CNN recipes use step decay at fixed milestones (e.g. epochs 30, 60, 90).
It is simple, predictable, and easy to reproduce from papers.
Exponential decay applies a continuous shrink:
lr(t) = lr₀ × γt. Polynomial decay
(lr(t) = lr₀ × (1 - t/T)p) is a smoother variant
common in segmentation and detection pipelines. These schedules assume you
know total training length upfront — if you extend training, the tail rates
may already be too small unless you retune.
Cosine annealing
Cosine annealing decays the rate along a cosine curve from
lr₀ down to a minimum lrmin (often 0)
over T steps:
lr(t) = lrmin + ½(lr₀ - lrmin)(1 + cos(πt/T))
The curve spends more time at moderate rates than step decay, which some
practitioners find yields better generalization. Cosine annealing with
warm restarts (SGDR) periodically resets t to 0 with
a slightly increased lr₀, letting the optimizer escape shallow
minima — useful in vision transfer learning when total budget is uncertain.
Learning rate warmup
Warmup linearly ramps the learning rate from 0 (or a tiny
value) to lr₀ over the first W steps. It became
standard for transformer pretraining because large batches produce noisy
gradient estimates at initialization; a cold start with full lr₀
can destabilize attention layers and layer-norm statistics.
Typical warmup lengths: 1–10% of total steps for BERT-scale models, sometimes thousands of steps for billion-parameter LLMs. Warmup is almost always paired with cosine or linear decay afterward — the Hugging Face Trainer defaults to linear warmup + linear decay unless overridden.
Rule of thumb: if you increase batch size by factor k and apply
linear scaling to lr₀, extend warmup proportionally or watch for
loss spikes in the first few hundred steps.
ReduceLROnPlateau
Unlike time-based schedules, ReduceLROnPlateau monitors a
validation metric (usually loss) and multiplies lr by a factor
(e.g. 0.5) when improvement stalls for patience epochs. It
adapts to actual training dynamics — helpful when you do not know how long
convergence will take or when data drift changes difficulty mid-run.
Caveats: plateau detection can fire on noisy validation curves (use a moving average or larger patience), and it interacts poorly with strong data augmentation that keeps validation loss artificially flat. Always log the schedule events to confirm reductions happen at sensible times.
One-cycle policy and cyclical learning rates
Leslie Smith's one-cycle policy ramps lr from
a low value up to a maximum (found via a range test), then anneals down
below the starting value in a single cycle spanning total training. Momentum
is often inverted (high when lr is low). One-cycle can cut training time for
small and medium models on fixed budgets.
Cyclical learning rates (CLR) oscillate between bounds across shorter cycles. They encourage periodic exploration and can improve generalization on some vision tasks, though transformers and LLM fine-tuning rarely use CLR in production — cosine + warmup dominates there.
Schedules and adaptive optimizers
Adam, AdamW, and RMSprop maintain per-parameter adaptive rates. A global schedule still multiplies those effective steps — it is not redundant. AdamW decouples weight decay from the gradient update; pairing AdamW with cosine decay and warmup is the de facto recipe for fine-tuning language models.
SGD with momentum + step decay remains competitive for large-batch ImageNet training when tuned carefully. Do not assume Adam's adaptivity removes the need for scheduling; papers that report SOTA results almost always include both an optimizer choice and a schedule.
Schedule comparison table
| Schedule | Best for | Needs total steps known? | Main risk |
|---|---|---|---|
| Step decay | CNNs, classical vision | Yes (milestones) | Abrupt drops; mistimed steps |
| Cosine annealing | Transformers, fine-tuning | Yes | Ends at near-zero lr if T too short |
| Warmup + cosine | LLM pretrain / fine-tune | Yes | Too-short warmup on large batches |
| ReduceLROnPlateau | Unknown budget, small tabular | No | Premature cuts on noisy val loss |
| One-cycle | Fast vision experiments | Yes | Max lr mis-estimated from range test |
| Constant | Baselines only | N/A | Slow convergence, poor final loss |
Decision guide: which schedule should you use?
| Scenario | Recommended schedule |
|---|---|
| Fine-tuning a pretrained transformer | Linear warmup (5–10% steps) + cosine or linear decay to 0 |
| Training ResNet from scratch on ImageNet | SGD + step decay at epochs 30/60/90, or cosine with warmup |
| Small dataset, unsure how long to train | AdamW + ReduceLROnPlateau on validation loss |
| Fixed GPU budget, need fastest convergence | One-cycle after lr range test |
| Large batch distributed training | Linear lr scaling + extended warmup + cosine decay |
| Reproducing a published paper | Match their schedule exactly — results are schedule-sensitive |
Worked example: cosine with warmup in practice
Suppose you fine-tune a 7B-parameter model for 10,000 steps with peak
lr₀ = 2e-5. Set warmup to 500 steps (5%): lr rises linearly
from 0 to 2e-5. For steps 500–10,000, apply cosine decay down to
lrmin = 0. At step 5,250 (midpoint of decay phase),
lr is roughly halfway between peak and minimum. Log lr each
step alongside train loss — if loss spikes during warmup, reduce
lr₀ or lengthen warmup; if loss flatlines early while lr is
still high, the peak may be too conservative.
For a smaller CNN trained 90 epochs with batch 256: start
lr₀ = 0.1 with SGD momentum 0.9, multiply by 0.1 at epochs 30
and 60. Validation accuracy often jumps right after each drop — a sign the
schedule is doing its job.
Common mistakes
- Applying step decay in epochs while the framework counts steps — milestones arrive too early or never.
- Resuming training from a checkpoint without restoring scheduler state —
lr jumps back to
lr₀. - Using ReduceLROnPlateau on training loss instead of validation loss — the schedule never triggers on overfitting runs.
- Skipping warmup when scaling batch size on transformers — divergence in the first 100 steps.
- Tuning only
lr₀while ignoring schedule shape — a bad decay curve wastes a good peak rate.
Practitioner checklist
- Log learning rate alongside loss every step or epoch.
- Confirm whether your scheduler counts steps or epochs.
- For transformers, default to warmup + cosine unless ablations say otherwise.
- Run a short lr range test before committing to one-cycle or a peak rate.
- Save and restore scheduler state in checkpoints for long runs.
- When increasing batch size, consider linear lr scaling and longer warmup.
- Compare schedules with the same total compute budget, not the same epoch count.
- Watch validation metrics after each manual step decay — timing matters.
- Document schedule parameters in experiment tracking (WandB, MLflow).
- Re-tune schedule when changing optimizer (SGD to AdamW is not a drop-in swap).
Key takeaways
- Learning rate schedules reduce lr over time so early training explores and late training refines.
- Warmup stabilizes large-batch and transformer training
before full
lr₀kicks in. - Cosine annealing is the default for modern NLP; step decay remains common in vision.
- ReduceLROnPlateau adapts to validation stagnation when total training length is unknown.
- Schedules interact with optimizer choice — tune them together, not in isolation.
Related reading
- Neural network optimizers explained — SGD, Adam, AdamW and how they use learning rates
- Hyperparameter tuning explained — search strategies for lr₀ and schedule parameters
- Deep learning explained — the training loop where schedules are applied
- Transformer architecture explained — why warmup and cosine decay dominate LLM training