Guide
Vanishing and exploding gradients explained
Training a deep neural network means repeatedly adjusting millions of weights using error signals that flow backward from the output layer. In shallow models this works reliably; in deep stacks the same mechanism can fail in two opposite ways. Vanishing gradients shrink the update signal until early layers barely move — the network looks stuck no matter how long you train. Exploding gradients do the opposite: updates grow until weights become NaN and training crashes. Both problems come from the same root — multiplying many partial derivatives through backpropagation — and both blocked practical deep learning until architects changed activations, initialization, and topology. This guide explains the chain-rule mechanics, how to recognize each failure mode in logs and learning curves, a Harbor Payments fraud-scoring MLP worked example, a fix decision table, common pitfalls, and a practitioner checklist. For the optimizer side, see gradient descent explained; for the nonlinearities that gate signal flow, see activation functions explained.
How gradients propagate through depth
During backpropagation, each layer's weight gradient equals the upstream gradient multiplied by that layer's local Jacobian — how much the layer's output changes when its inputs change. In a network with L layers, the gradient reaching layer 1 is roughly the product of L such local terms. If most factors are smaller than 1, the product shrinks exponentially with depth. If most are larger than 1, it grows exponentially. That multiplicative structure is why depth amplifies small per-layer mistakes into catastrophic training dynamics.
The local factor for a fully connected layer with activation function
σ includes σ′(z) — the derivative of the activation
at the pre-activation z. Saturating activations like sigmoid and
tanh have derivatives that peak around 0.25 and approach zero in their flat
tails. Ten consecutive sigmoid layers can shrink a gradient by a factor of
10−10 or more. ReLU and GELU behave differently: ReLU passes
gradient 1 through active units and 0 through dead ones, which avoids the
universal shrinkage of sigmoid but introduces its own failure mode (dead ReLUs).
Vanishing vs exploding at a glance
Vanishing gradients produce flat learning curves, especially
for metrics tied to early-layer features; validation loss stalls while later
layers may still learn slowly. Exploding gradients show up as
sudden loss spikes, weight norms climbing each epoch, or literal
NaN / Inf in the loss. Recurrent networks add a
time dimension — the same multiplication happens across unrolled timesteps, so
RNNs were especially prone to both problems before LSTM gates and GRU designs
addressed long-range credit assignment.
Vanishing gradients: causes and symptoms
The classic vanishing scenario stacks sigmoid or tanh layers with weights
initialized too small or too large. Small weights push pre-activations into
saturation where σ′(z) ≈ 0; large weights also saturate by
driving z into the flat tails. Either way, early layers receive
near-zero gradients and stop contributing — the network effectively behaves
like a shallow model with a frozen feature extractor that never learned useful
features.
Symptoms in practice: training loss decreases for the first few epochs then plateaus well above baseline; gradient histograms show early-layer gradients orders of magnitude smaller than output-layer gradients; ablation that removes early layers barely hurts accuracy because those layers never trained. In transformers, vanishing is less common in feed-forward blocks thanks to residual connections and layer normalization, but it still appears in very deep unnormalized MLP stacks or poorly initialized embedding layers.
Exploding gradients: causes and symptoms
Exploding gradients occur when the product of Jacobians consistently exceeds 1. Common triggers: learning rate too high for the current loss landscape, weights initialized with large variance, recurrent networks processing long sequences without gating, and attention logits growing unbounded before softmax. A single bad minibatch with an outlier label can spike gradients enough to corrupt weights permanently if no clipping is applied.
Watch for loss jumping from 0.4 to 400 in one step, optimizer state norms diverging, or CUDA overflow warnings. Unlike vanishing, exploding is often acute — training was fine until it suddenly wasn't. Mitigation is usually immediate: lower learning rate, clip global gradient norm, switch to mixed-precision loss scaling only after confirming the root cause is not a data bug, and inspect batches for mislabeled extremes.
Worked example: Harbor Payments deep fraud MLP
Harbor Payments trains a fully connected classifier on 42 transaction features — amount, merchant category, device fingerprint hashes, velocity counters — to flag card-not-present fraud. The first production model used six hidden layers (256→128→64→32→16→8 units) with sigmoid activations and default Xavier initialization. Training loss dropped from 0.69 to 0.55 in epoch 1, then flatlined for 40 epochs. Validation AUC stuck at 0.71 while a two-layer ReLU baseline reached 0.89.
Gradient logging told the story: layer-1 weight gradients averaged 10−8 while layer-6 gradients averaged 10−2 — five orders of magnitude apart. Swapping sigmoid for ReLU on hidden layers, applying He initialization, and adding batch normalization after each dense block brought layer-1 gradients into the 10−4 range within two epochs. AUC climbed to 0.90 by epoch 15. They kept depth at four hidden layers — enough capacity for interaction features without unnecessary multiplication depth — and added global gradient clipping at norm 1.0 as insurance during high-variance holiday traffic weeks. The fix was architectural, not hyperparameter tuning alone: no amount of learning-rate search rescues a sigmoid stack whose early layers receive no signal.
Fix decision table
| Symptom / context | Likely cause | First-line fix |
|---|---|---|
| Deep MLP plateaus; early-layer gradients near zero | Saturating activations + depth | ReLU/GELU hidden layers; He initialization; batch or layer norm |
| Loss spikes to NaN mid-training | Exploding gradients or bad batch | Gradient clipping; lower LR; inspect labels and loss scaling |
| RNN forgets long-range dependencies | Vanishing through time | LSTM/GRU; truncated BPTT; attention mechanisms |
| Many ReLU units output exactly zero forever | Dead ReLUs from large negative bias or LR | Leaky ReLU/ELU; lower LR; better initialization |
| Very deep CNN or transformer unstable | Jacobian product across blocks | Residual skip connections; pre-norm; warmup LR schedule |
| Gradients OK but loss noisy | Minibatch variance, not vanishing/exploding | Larger batch, gradient accumulation, or AdamW — do not over-clip |
Architectural and training remedies
Modern deep learning stacks several defenses. Better activations — ReLU family and GELU — keep derivatives near 1 in their active regions. Careful initialization (Xavier for tanh/sigmoid, He for ReLU) sets initial activations in the linear-ish part of each nonlinearity. Normalization layers re-center and rescale activations each forward pass, preventing internal covariate shift that pushes units into saturation.
Residual skip connections add a direct path from input to output of each block, so gradients can flow around deep sub-stacks even if the residual branch's Jacobian is small — the identity shortcut carries at least part of the signal. Gradient clipping caps the global L2 norm of all parameter gradients before the optimizer step; it does not fix vanishing but prevents explosions. Learning rate warmup starts small and ramps up, avoiding large destructive steps while weights are still poorly scaled. Together these techniques are why 100-layer networks train at all; without them, depth is a liability.
Common pitfalls
- Diagnosing vanishing when the real problem is data — a plateau near random-guess AUC may mean no signal in features, not dead gradients; always compare against a shallow baseline.
- Clipping too aggressively — norm 0.01 clipping on a healthy network slows convergence without fixing vanishing; clip only when norms spike or use a reasonable ceiling (0.5–5.0).
- Applying batch norm wrong at inference — running stats must match training mode during eval; train-serve skew mimics vanishing symptoms in production.
- Stacking depth without residual paths — adding layers to an MLP rarely helps past 3–4 hidden blocks unless normalization and modern activations are in place.
- Ignoring mixed-precision overflow — float16 forward with float32 loss scaling can look like exploding gradients; check whether NaNs appear only under AMP.
- Treating LSTM as a vanishing cure for bad features — gating helps temporal credit assignment but cannot invent predictive signal from noise.
- Logging only loss, not gradient norms per layer — layer-wise histograms or norm ratios catch problems in hours instead of weeks of blind tuning.
Practitioner checklist
- Start with a shallow ReLU baseline; confirm depth is justified by a metric gain.
- Log per-layer gradient L2 norms or histograms for the first 1–3 epochs.
- Use He init for ReLU/GELU stacks, Xavier for tanh/sigmoid if you must use them.
- Add batch norm (CNNs/MLPs) or layer norm (transformers/RNNs) when depth exceeds 2–3 blocks.
- Enable global gradient clipping (start at norm 1.0) for RNNs and large transformers.
- Use learning rate warmup for deep models and fine-tuning pretrained backbones.
- Prefer residual connections when stacking identical blocks beyond depth 4.
- Investigate sudden NaNs immediately — checkpoint last good weights before restarting.
- Separate minibatch noise from structural vanishing by comparing gradient ratios across layers.
- Document which fixes were applied; depth without the accompanying stack is a regression risk.
Key takeaways
- Vanishing and exploding gradients arise from multiplying many partial derivatives through deep networks during backprop.
- Saturating activations and poor initialization cause vanishing; large learning rates and unbounded recurrence cause exploding.
- Symptoms differ: vanishing plateaus loss with tiny early-layer gradients; exploding spikes loss to NaN.
- ReLU-family activations, proper init, normalization, residuals, and clipping are the standard fix stack — not just a lower learning rate.
- Log gradient norms per layer early; a shallow baseline tells you whether depth is helping or hurting.
Related reading
- Backpropagation explained — chain rule and computational graphs
- Activation functions explained — ReLU, sigmoid, GELU and derivative behavior
- Batch normalization explained — stabilizing internal activations
- Learning rate scheduling explained — warmup and decay strategies