Guide

Learning rate scheduling explained

The learning rate controls how far each gradient step moves weights during deep learning training. Too high and loss oscillates or diverges; too low and training crawls toward a mediocre minimum. A learning rate schedule changes that step size over time — typically starting larger to explore the loss landscape, then shrinking to settle into a sharp minimum. Modern optimizers like AdamW adapt per-parameter steps, but a global schedule still matters: transformer pretraining relies on warmup plus cosine decay, computer vision models use step decay, and practitioners lean on ReduceLROnPlateau when validation loss stalls. This guide explains why fixed rates fail, the major schedule families, how warmup interacts with large batches and transformers, how to combine schedules with hyperparameter tuning, and a decision table for picking the right strategy.

Why a constant learning rate is rarely enough

Neural network loss surfaces are high-dimensional and non-convex. Early in training, large gradients point toward broad basins — a higher learning rate helps escape poor initializations and saddle regions quickly. Late in training, the same step size overshoots fine structure near the minimum, causing validation loss to bounce or plateau. Schedules encode the intuition that exploration should dominate early and refinement late.

Mini-batch noise adds another dimension: a rate that works at batch size 32 may diverge at batch size 4,096 unless you scale the rate (linear scaling rule) and add warmup. Without scheduling, you often end up with a compromise rate that is suboptimal for both phases — slower convergence and worse final accuracy than a well-chosen decay curve.

Core concepts: steps, epochs, and base learning rate

Schedules are defined over optimizer steps (one per mini-batch) or epochs (one full pass through the training set). PyTorch and TensorFlow schedulers usually count steps; Keras often uses epochs. Always confirm which unit your framework uses before comparing papers.

The base learning rate (lr₀) is the peak or initial value the schedule modulates. Finding lr₀ is itself a hyperparameter — learning rate range tests and one-cycle policies automate part of that search. The schedule outputs a multiplier or absolute rate lr(t) at step t.

Step decay and exponential decay

Step decay multiplies the learning rate by a factor γ (often 0.1) every N epochs or steps. ResNet and many CNN recipes use step decay at fixed milestones (e.g. epochs 30, 60, 90). It is simple, predictable, and easy to reproduce from papers.

Exponential decay applies a continuous shrink: lr(t) = lr₀ × γ^t. Polynomial decay (lr(t) = lr₀ × (1 - t/T)^p) is a smoother variant common in segmentation and detection pipelines. These schedules assume you know total training length upfront — if you extend training, the tail rates may already be too small unless you retune.

Cosine annealing

Cosine annealing decays the rate along a cosine curve from lr₀ down to a minimum lr_min (often 0) over T steps:

lr(t) = lr_min + ½(lr₀ - lr_min)(1 + cos(πt/T))

The curve spends more time at moderate rates than step decay, which some practitioners find yields better generalization. Cosine annealing with warm restarts (SGDR) periodically resets t to 0 with a slightly increased lr₀, letting the optimizer escape shallow minima — useful in vision transfer learning when total budget is uncertain.

Learning rate warmup

Warmup linearly ramps the learning rate from 0 (or a tiny value) to lr₀ over the first W steps. It became standard for transformer pretraining because large batches produce noisy gradient estimates at initialization; a cold start with full lr₀ can destabilize attention layers and layer-norm statistics.

Typical warmup lengths: 1–10% of total steps for BERT-scale models, sometimes thousands of steps for billion-parameter LLMs. Warmup is almost always paired with cosine or linear decay afterward — the Hugging Face Trainer defaults to linear warmup + linear decay unless overridden.

Rule of thumb: if you increase batch size by factor k and apply linear scaling to lr₀, extend warmup proportionally or watch for loss spikes in the first few hundred steps.

ReduceLROnPlateau

Unlike time-based schedules, ReduceLROnPlateau monitors a validation metric (usually loss) and multiplies lr by a factor (e.g. 0.5) when improvement stalls for patience epochs. It adapts to actual training dynamics — helpful when you do not know how long convergence will take or when data drift changes difficulty mid-run.

Caveats: plateau detection can fire on noisy validation curves (use a moving average or larger patience), and it interacts poorly with strong data augmentation that keeps validation loss artificially flat. Always log the schedule events to confirm reductions happen at sensible times.

One-cycle policy and cyclical learning rates

Leslie Smith's one-cycle policy ramps lr from a low value up to a maximum (found via a range test), then anneals down below the starting value in a single cycle spanning total training. Momentum is often inverted (high when lr is low). One-cycle can cut training time for small and medium models on fixed budgets.

Cyclical learning rates (CLR) oscillate between bounds across shorter cycles. They encourage periodic exploration and can improve generalization on some vision tasks, though transformers and LLM fine-tuning rarely use CLR in production — cosine + warmup dominates there.

Schedules and adaptive optimizers

Adam, AdamW, and RMSprop maintain per-parameter adaptive rates. A global schedule still multiplies those effective steps — it is not redundant. AdamW decouples weight decay from the gradient update; pairing AdamW with cosine decay and warmup is the de facto recipe for fine-tuning language models.

SGD with momentum + step decay remains competitive for large-batch ImageNet training when tuned carefully. Do not assume Adam's adaptivity removes the need for scheduling; papers that report SOTA results almost always include both an optimizer choice and a schedule.

Schedule comparison table

Schedule	Best for	Needs total steps known?	Main risk
Step decay	CNNs, classical vision	Yes (milestones)	Abrupt drops; mistimed steps
Cosine annealing	Transformers, fine-tuning	Yes	Ends at near-zero lr if T too short
Warmup + cosine	LLM pretrain / fine-tune	Yes	Too-short warmup on large batches
ReduceLROnPlateau	Unknown budget, small tabular	No	Premature cuts on noisy val loss
One-cycle	Fast vision experiments	Yes	Max lr mis-estimated from range test
Constant	Baselines only	N/A	Slow convergence, poor final loss

Decision guide: which schedule should you use?

Scenario	Recommended schedule
Fine-tuning a pretrained transformer	Linear warmup (5–10% steps) + cosine or linear decay to 0
Training ResNet from scratch on ImageNet	SGD + step decay at epochs 30/60/90, or cosine with warmup
Small dataset, unsure how long to train	AdamW + ReduceLROnPlateau on validation loss
Fixed GPU budget, need fastest convergence	One-cycle after lr range test
Large batch distributed training	Linear lr scaling + extended warmup + cosine decay
Reproducing a published paper	Match their schedule exactly — results are schedule-sensitive

Worked example: cosine with warmup in practice

Suppose you fine-tune a 7B-parameter model for 10,000 steps with peak lr₀ = 2e-5. Set warmup to 500 steps (5%): lr rises linearly from 0 to 2e-5. For steps 500–10,000, apply cosine decay down to lr_min = 0. At step 5,250 (midpoint of decay phase), lr is roughly halfway between peak and minimum. Log lr each step alongside train loss — if loss spikes during warmup, reduce lr₀ or lengthen warmup; if loss flatlines early while lr is still high, the peak may be too conservative.

For a smaller CNN trained 90 epochs with batch 256: start lr₀ = 0.1 with SGD momentum 0.9, multiply by 0.1 at epochs 30 and 60. Validation accuracy often jumps right after each drop — a sign the schedule is doing its job.

Common mistakes

Applying step decay in epochs while the framework counts steps — milestones arrive too early or never.
Resuming training from a checkpoint without restoring scheduler state — lr jumps back to lr₀.
Using ReduceLROnPlateau on training loss instead of validation loss — the schedule never triggers on overfitting runs.
Skipping warmup when scaling batch size on transformers — divergence in the first 100 steps.
Tuning only lr₀ while ignoring schedule shape — a bad decay curve wastes a good peak rate.

Practitioner checklist

Log learning rate alongside loss every step or epoch.
Confirm whether your scheduler counts steps or epochs.
For transformers, default to warmup + cosine unless ablations say otherwise.
Run a short lr range test before committing to one-cycle or a peak rate.
Save and restore scheduler state in checkpoints for long runs.
When increasing batch size, consider linear lr scaling and longer warmup.
Compare schedules with the same total compute budget, not the same epoch count.
Watch validation metrics after each manual step decay — timing matters.
Document schedule parameters in experiment tracking (WandB, MLflow).
Re-tune schedule when changing optimizer (SGD to AdamW is not a drop-in swap).

Key takeaways

Learning rate schedules reduce lr over time so early training explores and late training refines.
Warmup stabilizes large-batch and transformer training before full lr₀ kicks in.
Cosine annealing is the default for modern NLP; step decay remains common in vision.
ReduceLROnPlateau adapts to validation stagnation when total training length is unknown.
Schedules interact with optimizer choice — tune them together, not in isolation.