Guide
Gradient descent explained
When a spam filter learns to spot phishing or a language model predicts the next token, the underlying optimization algorithm is almost always some form of gradient descent: repeatedly measure how wrong the model is with a loss function, compute the direction of steepest increase (the gradient), and take a small step in the opposite direction. That single idea — walk downhill on a high-dimensional surface — powers linear regression, deep neural networks, and fine-tuned LLMs alike. This guide explains loss landscapes, the learning rate knob, batch versus stochastic versus mini-batch updates, momentum intuition, a Harbor Payments fraud-scorer worked example, a variant decision table, common pitfalls, and a practical training checklist. For how gradients flow through layered networks, see backpropagation explained; for Adam, RMSprop, and beyond, see neural network optimizers.
What gradient descent is optimizing
Machine learning models are parameterized functions: weights and biases w map inputs to predictions. Training finds w that minimize a scalar loss L(w) — mean squared error for regression, cross-entropy for classification, or more exotic objectives for ranking and generative models. Think of L as height on a landscape: each coordinate is one parameter, and the goal is to reach a valley.
The gradient ∇L(w) is a vector of partial derivatives. Each entry answers: if I nudge this one weight slightly, does loss go up or down? The gradient points toward steepest ascent. Gradient descent flips that direction:
w ← w - η ∇L(w)
The scalar η is the learning rate — step size. Too small and training crawls; too large and you overshoot valleys or diverge entirely. In deep learning, backpropagation efficiently computes ∇L for millions of parameters; gradient descent (or a fancier optimizer) consumes those gradients to update weights.
Local minima, saddle points, and non-convexity
Linear models with convex losses have one global minimum — any downhill path eventually arrives. Neural networks are non-convex: countless local minima and saddle points (flat in some directions, steep in others) litter the landscape. In practice, wide minima with low loss often generalize well even if they are not globally optimal; sharp minima can overfit. Stochastic noise from mini-batches helps escape shallow saddles — one reason pure full-batch descent on huge nets can feel stuck while mini-batch SGD progresses.
The learning rate: the most important hyperparameter
Learning rate controls how far each update moves. Picture walking down a foggy hillside with a blindfold, feeling the slope under your feet:
- Too small — thousands of epochs to converge; risk of stopping in a flat plateau before reaching a good minimum.
- Too large — loss oscillates or explodes to NaN; you bounce across valleys without settling.
- Just right — steady loss decrease on a held-out validation set without wild spikes.
Common tactics: start with a moderate rate (e.g. 1e-3 for Adam on transformers, 0.01–0.1 for logistic regression), watch validation loss, and reduce when progress stalls. Learning rate scheduling — step decay, cosine annealing, warmup — often beats a fixed rate for deep models. Learning rate finders (increase η until loss spikes, then back off) give a data-driven starting point.
Note: adaptive optimizers like Adam maintain per-parameter step sizes. You still set a base learning rate, but effective steps vary by weight — see our optimizers guide for when Adam beats plain SGD.
Batch, stochastic, and mini-batch gradient descent
The gradient can be computed over different subsets of data. The variant names describe how many examples contribute to each update.
Batch (full) gradient descent
Use the entire training set to compute ∇L each step. Updates are stable and point directly at the true descent direction for the epoch, but each step is expensive on large datasets — one pass over millions of rows per update is impractical. Useful for small tabular problems and as a conceptual baseline.
Stochastic gradient descent (SGD)
Update after one random example. Cheap per step and noisy — noise can help generalization and escape saddles — but loss curves look jagged. Rare in modern deep learning at true batch-size-one; the name survives in "SGD with momentum" even when batch size is 32 or 256.
Mini-batch gradient descent
The practical default: sample a mini-batch of B examples (32, 64, 128, 256…), compute average loss and gradient over the batch, update, repeat. GPU parallelism favors batch sizes that fill tensor cores; too small underutilizes hardware, too large may need a lower learning rate and can generalize worse. Epoch = one full pass through the training set; steps per epoch = N / B for N training examples.
Momentum and beyond plain descent
Plain gradient descent can zigzag in narrow valleys — the gradient oscillates across walls while making slow progress along the floor. Momentum accumulates a velocity vector: updates remember past directions and smooth out oscillations, accelerating along consistent slopes. Nesterov momentum looks one step ahead before computing the gradient — a small refinement that often helps.
Modern frameworks default to Adam or AdamW for deep nets: adaptive per-parameter rates plus momentum-like terms. Classical SGD + momentum still wins some vision benchmarks when tuned carefully. Gradient descent is the conceptual core; production training stacks layer schedulers, weight decay, gradient clipping, and mixed precision on top.
Worked example: Harbor Payments fraud scorer
Harbor Payments routes card transactions in real time. The risk team trains a logistic regression model — twelve engineered features (amount z-score, merchant category risk, velocity counts, device fingerprint mismatch, etc.) — to output fraud probability. Training uses mini-batch gradient descent:
- Initialize weights to zero (or small random values) and bias to log-odds of the base fraud rate (~0.3%).
- Each step, sample a batch of 512 transactions (stratified so roughly 1% positives appear per batch despite class imbalance).
- Forward pass: compute logits, apply sigmoid, get probabilities.
- Loss: binary cross-entropy averaged over the batch, with class weights so false negatives cost more than false positives.
- Backward pass: gradients of loss w.r.t. weights — closed form for logistic regression, no deep backprop needed.
- Update: w ← w - 0.05 ∇L (learning rate 0.05 after grid search on a validation week).
- Repeat for 20 epochs over 8 million historical transactions; early stop if validation AUC plateaus for three epochs.
Before deploy, engineers verify: training and validation loss curves descend smoothly (no divergence), gradient norms are bounded, calibration holds on a holdout month, and latency at inference is under 2 ms. They log learning rate and batch size in the experiment tracker so future retrains reproduce the run.
Variant decision table
| Situation | Preferred variant | Why |
|---|---|---|
| Small tabular dataset (<10k rows) | Full-batch or large mini-batch | Cheap per epoch; stable gradients |
| Deep neural network on GPU | Mini-batch (32–512) + AdamW | Hardware utilization; adaptive steps |
| Online / streaming data | True SGD (batch size 1) or small batches | Model updates before full data scan |
| Convex linear model | Full-batch or LBFGS | Guaranteed convergence to global minimum |
| Noisy labels, need regularization | Smaller batches + weight decay | Gradient noise acts as implicit regularizer |
| Training loss flat but val loss high | Lower LR, early stopping, larger batch | Reduce overfitting; settle in wider minimum |
Common pitfalls
- Not shuffling training data — ordered data (sorted by time or label) biases batch gradients and slows convergence.
- Tuning batch size without retuning learning rate — larger batches often need higher LR (linear scaling rule is a starting heuristic, not gospel).
- Judging convergence on training loss alone — validation loss and task metrics (AUC, F1) reveal overfitting.
- Ignoring gradient clipping on RNNs and LLMs — exploding gradients cause NaN weights mid-training.
- Using test set for learning rate search — hyperparameter tuning belongs on validation data; test set is for final report only.
- Assuming zero loss is the goal — perfect training fit usually means memorization; aim for best generalization.
- Forgetting to scale features — unnormalized inputs make loss landscapes ill-conditioned; gradient descent struggles on skewed scales.
Practical checklist
- Plot training and validation loss per epoch — both should trend down early; divergence signals LR or bug issues.
- Normalize or standardize continuous features before gradient-based training.
- Shuffle training batches each epoch; preserve temporal order only when doing time-series validation splits.
- Log learning rate, batch size, optimizer, and seed in every experiment run.
- Run a learning rate range test or grid search on validation data before long training jobs.
- Apply LR scheduling for multi-epoch deep learning runs.
- Monitor gradient norm histograms — sudden spikes precede NaN failures.
- Use early stopping on validation metric with a patience window.
- Compare against a simple baseline (logistic regression, majority class) before trusting a deep model.
- Reproducibility: fix random seeds for data shuffling and weight init when debugging.
Key takeaways
- Gradient descent minimizes loss by stepping opposite the gradient, controlled by learning rate η.
- Mini-batch updates balance noise, speed, and GPU efficiency — the default for modern ML.
- Learning rate is the first hyperparameter to tune; schedulers and adaptive optimizers build on the same idea.
- Non-convex landscapes mean you seek good minima, not provably global ones — validation metrics arbitrate.
- Gradient descent connects classical machine learning to deep nets via backpropagation and optimizer stacks.
Related reading
- Backpropagation explained — how gradients are computed through neural networks
- Loss functions explained — what gradient descent is minimizing
- Neural network optimizers explained — Adam, RMSprop, and SGD with momentum
- Learning rate scheduling explained — decay, warmup, and cosine schedules