Guide
Backpropagation explained
Every trained neural network — from a tiny spam filter to a billion-parameter language model — relies on the same learning engine: backpropagation (short for backward propagation of errors). The idea is deceptively simple: run inputs forward through the network, measure how wrong the output is with a loss function, then walk backward layer by layer applying the chain rule from calculus to compute how each weight contributed to that error. Those gradients feed optimizers like SGD and Adam that nudge weights in the direction that reduces loss. This guide explains the forward and backward passes, computational graphs, gradients through common layers, vanishing and exploding gradients, how modern frameworks automate the math with autodiff, and how to debug training when gradients go wrong.
The training loop in one picture
Training is a repeating cycle. A mini-batch of labeled examples enters the network. The forward pass computes predictions. The loss quantifies error (MSE for regression, cross-entropy for classification). Backpropagation computes partial derivatives of loss with respect to every trainable parameter. The optimizer applies an update rule — typically weight minus learning rate times gradient. Repeat for thousands of batches until validation loss plateaus.
Backpropagation is not a separate algorithm from gradient descent; it is the efficient procedure for computing gradients in layered, differentiable models. Without it, you would need finite-difference approximation over millions of weights — impossibly slow. With it, each parameter's gradient is obtained in roughly the same time as one forward pass.
Forward pass: building the computational graph
A neural network is a composition of differentiable functions. Each layer transforms its input:
z = W · x + b (linear transform) followed by a = f(z) (nonlinear activation). Stacking L layers gives a deep function: output = fL(fL-1(...f1(x))).
During the forward pass, frameworks record every operation on a computational graph — a directed acyclic graph where nodes are tensors (values) and edges are operations (add, matmul, ReLU, etc.). This graph is the roadmap for the backward pass. Modern libraries (PyTorch, JAX, TensorFlow) build it dynamically as code executes; older static-graph frameworks compiled it ahead of time.
The forward pass also caches intermediate activations. Backprop needs them: the gradient through a ReLU layer depends on whether the pre-activation was positive. Memory therefore scales with batch size, layer width, and sequence length — a key constraint when training large transformers on limited GPU RAM.
The chain rule: why backward works
Suppose loss L depends on weight w through intermediate variables. If L = g(h(w)), the chain rule says:
dL/dw = (dL/dh) · (dh/dw)
In a 50-layer network, L depends on w through a chain of 50 functions. The gradient is a product of 50 local Jacobians. Backpropagation applies the chain rule once per node, reusing partial results — dynamic programming on the graph. Starting from the loss node, it propagates "upstream gradients" to each parent.
Consider a two-layer network: input x, hidden h = ReLU(W1x + b1), output y = W2h + b2, loss L = ½(y - target)2. Backward pass order:
- dL/dy = y - target (MSE derivative).
- dL/dW2 = dL/dy · hT (outer product).
- dL/dh = W2T · dL/dy (chain through linear layer).
- dL/dz1 = dL/dh ⊙ ReLU'(z1) (element-wise mask).
- dL/dW1 = dL/dz1 · xT.
Each step uses only local information plus the gradient arriving from above. That locality is what makes backprop scalable to billions of parameters.
Gradients through common building blocks
Linear (fully connected) layer
y = Wx + b. Given upstream gradient dL/dy, weight gradient is dL/dW = (dL/dy) xT (batched as matrix multiply). Bias gradient is row-sum of dL/dy. Input gradient dL/dx = WT (dL/dy) — passed to the previous layer.
ReLU activation
ReLU(z) = max(0, z). Derivative is 1 where z > 0, else 0. Backward pass multiplies upstream gradient by this binary mask. Dead ReLUs (permanently zero gradient) motivated leaky ReLU and GELU alternatives in modern architectures.
Sigmoid and tanh
Sigmoid squashes to (0, 1); derivative peaks at 0.25. Tanh derivative peaks at 1 but still saturates at extremes. Stacking many sigmoid layers multiplies small derivatives — the classic vanishing gradient problem that stalled deep networks before ReLU and residual connections.
Softmax + cross-entropy
For classification, softmax and cross-entropy are often fused. The combined gradient simplifies beautifully to (prediction - one_hot_target) / batch_size — numerically stable and fast. This is why you should use the framework's combined loss, not softmax output followed by manual log.
Convolution and attention
Convolutions are structured linear ops; backprop is convolution with flipped kernels (or equivalently, im2col + matmul). Self-attention backward pass involves gradients through query-key products and softmax — memory-heavy, which is why gradient checkpointing trades extra forward recomputation for lower activation memory.
Vanishing and exploding gradients
When gradients are multiplied across many layers, they can shrink toward zero (vanishing) or grow without bound (exploding). Symptoms:
- Vanishing: early layers learn glacially; loss flatlines; RNNs forget long-range dependencies — the problem LSTMs were designed to fix.
- Exploding: loss spikes to NaN; weights become Inf; training diverges in a few steps.
Mitigations that work in practice:
- ReLU-family activations instead of sigmoid in hidden layers.
- Residual (skip) connections — gradients can flow directly around blocks; essential in ResNet and transformers.
- Layer normalization and batch normalization — stabilize activation scale entering each layer.
- Gradient clipping — cap global norm before the optimizer step.
- Careful weight initialization (Xavier, He) so activations start in a reasonable range.
- Lower learning rate when loss is unstable.
Automatic differentiation vs manual gradients
You rarely implement backprop by hand except for education or custom CUDA kernels. Frameworks provide automatic differentiation (autodiff):
- Forward-mode autodiff — efficient when inputs are few, outputs many (uncommon in ML).
- Reverse-mode autodiff — exactly backpropagation; efficient when parameters are few relative to intermediate nodes (typical training).
PyTorch's loss.backward() triggers reverse-mode autodiff on the
graph built during forward. JAX's grad transforms functions.
TensorFlow's GradientTape records operations similarly. Custom layers must
implement forward and register backward rules — get the Jacobian
wrong and training silently fails.
Zero-order methods (evolution strategies, finite differences) do not need gradients but scale poorly with parameter count — useful for black-box tuning, not LLM training.
Memory, compute, and checkpointing
Backprop's cost is roughly 2-3x a forward pass in FLOPs, but activation memory can dominate. For a batch of B sequences of length T through L layers of width H, storing all activations is O(B · T · H · L).
Gradient checkpointing (activation checkpointing) discards some intermediate activations during forward and recomputes them during backward. Trades ~33% more compute for 50-80% less memory — standard for large-model training.
Mixed-precision training (FP16/BF16 forward, FP32 master weights) speeds matmul and cuts memory while keeping gradient updates stable via loss scaling.
Debugging gradient flow
When training misbehaves, inspect gradients before blaming the architecture:
- Gradient norm histogram — log per-layer L2 norms; dead layers show near-zero norms.
- Check for NaN/Inf — often from log(0), division by zero, or learning rate too high.
- Verify loss decreases on a single batch — overfit one batch; if loss cannot reach ~zero, bug in labels, loss, or frozen layers.
- Disable augmentation temporarily — isolate data pipeline bugs.
- Compare autograd to finite differences on a tiny network — catches custom backward errors.
- Watch for detatched tensors — accidental
.detach()or inference-mode context stops gradient flow.
Backprop vs other learning paradigms
| Paradigm | Needs gradients? | Typical use |
|---|---|---|
| Supervised backprop | Yes | Classification, regression, fine-tuning LLMs |
| Reinforcement learning (policy gradient) | Yes, through policy | Games, robotics, RLHF reward models |
| Evolution strategies / genetic algorithms | No | Small networks, hyperparameter search |
| Tree boosting (XGBoost) | No (greedy splits) | Tabular data, Kaggle baselines |
| k-NN, random forests | No | Interpretable baselines, small data |
Backprop requires differentiable (or sub-differentiable) operations end-to-end. Discrete choices — hard argmax, non-differentiable sorting — need straight-through estimators, reinforcement signals, or relaxation tricks.
Common mistakes
- Forgetting
optimizer.zero_grad()— gradients accumulate across steps. - Training in eval mode — dropout and batch norm behave differently; no gradient through dropped units.
- Wrong loss for the task — MSE on classification, or raw logits without sigmoid for BCEWithLogitsLoss.
- Not shuffling training data — correlated batches bias gradient estimates.
- Learning rate too high — loss oscillates or diverges despite correct backprop.
- Freezing the wrong layers —
requires_grad=Falseon layers that should learn. - Assuming deeper always helps — without residuals and normalization, gradients vanish and extra layers hurt.
Production checklist
- Confirm loss function matches task type and output activation.
- Log training and validation loss every epoch; watch for divergence early.
- Monitor per-layer gradient norms during the first 100 steps.
- Use mixed precision with dynamic loss scaling on supported GPUs.
- Apply gradient clipping when training RNNs or very deep nets.
- Enable gradient checkpointing if activation memory OOMs before compute saturates.
- Unit-test custom layers with finite-difference gradient checks.
- Version control exact optimizer, LR schedule, and batch size with each experiment.
- Checkpoint model weights before LR warmup completes — early instability is common.
- Compare against a simpler baseline (logistic regression, small MLP) before scaling up.
Key takeaways
- Backpropagation efficiently computes gradients via the chain rule on a computational graph.
- Forward pass caches activations; backward pass reuses them to propagate error signals to every weight.
- Vanishing/exploding gradients are managed with ReLU, residuals, normalization, clipping, and careful initialization.
- Autodiff frameworks automate the math — focus on architecture, loss, and data; verify custom ops.
- Gradients are diagnostic — when training fails, inspect them before adding layers or data.
Related reading
- Deep learning explained — neural network layers, training loops, and the bridge to transformers
- Neural network optimizers explained — SGD, Adam, learning rate schedules that consume gradients
- Loss functions explained — MSE, cross-entropy, focal loss and what your model optimizes
- Overfitting and cross-validation explained — when gradient descent memorizes instead of generalizes