Guide
Activation functions explained
Stack ten linear layers in a neural network and you still have one big linear transformation — matrix multiplies compose, but they never bend the decision boundary. Activation functions are the small nonlinear gates applied after each weighted sum that let deep models approximate arbitrary functions. Pick the wrong one and training stalls (vanishing gradients with sigmoid in a 50-layer CNN) or outputs saturate (ReLU dying on a negative-only stream). This guide explains what activations do, walks through the functions you will actually see in production — ReLU, Leaky ReLU, sigmoid, tanh, softmax, GELU, and Swish — maps them to hidden vs output layers, connects them to gradient flow during backpropagation, and ends with a decision table and checklist for deep learning practitioners.
Why nonlinearity is non-negotiable
A single neuron computes z = w·x + b, then applies an activation
a = f(z). Without f, or with f linear
(identity), a network of arbitrary depth is equivalent to one weight matrix —
it can only learn linear decision surfaces and linear regression fits.
Nonlinear f breaks that collapse. The universal approximation
theorem guarantees that a feedforward network with one sufficiently wide hidden
layer and a nonlinear activation can approximate continuous functions on
compact domains — in practice we use many narrower layers because depth
composes features hierarchically (edges, then shapes, then objects in vision).
Activations also bound or reshape signal magnitude. Sigmoid squashes to
(0, 1); ReLU zeroes negatives; softmax turns a vector of logits
into a probability distribution. Those ranges matter for numerical stability
and for matching the
loss function
at the output layer.
Core activation functions
ReLU — Rectified Linear Unit
f(z) = max(0, z). The default hidden activation for CNNs and
many MLPs since 2010. Cheap to compute (one comparison), sparse activations
(roughly half the units are zero on random init), and its gradient is either
0 or 1 — no saturation on the positive side, which helps deep networks train.
Dead ReLU problem: if a neuron's weights push all inputs negative, the unit outputs 0 forever and its gradient is 0 — it stops learning. Mitigations: Leaky ReLU, PReLU (learnable slope), better initialization (He init for ReLU nets), or lower learning rates.
Leaky ReLU and variants
f(z) = z if z > 0, else αz with small
α (often 0.01). Negative inputs get a tiny gradient instead of
zero. Parametric ReLU (PReLU) learns α per channel. ELU and
SELU smooth the negative side further; SELU pairs with specific initialization
for self-normalizing nets — niche but useful when batch sizes are tiny.
Sigmoid
f(z) = 1 / (1 + e−z). Outputs in (0, 1),
interpretable as probability. Historically the default; now mostly relegated
to binary output layers and gates (LSTM forget gates). Problems at depth:
gradients shrink as f'(z) = f(z)(1 − f(z)), which peaks at 0.25
and approaches zero when saturated — the vanishing gradient problem that
stalled deep nets before ReLU.
Hyperbolic tangent (tanh)
f(z) = (ez − e−z) / (ez + e−z).
Zero-centered output in (−1, 1) — often faster convergence than
sigmoid in shallow RNN hidden states. Still saturates at extremes; largely
replaced by ReLU in feedforward stacks and by gated mechanisms in modern
recurrent and transformer blocks.
Softmax
Applied to a vector z of logits:
f(zi) = ezi / Σj ezj.
Outputs sum to 1 — a categorical probability distribution. Standard final
layer for multi-class classification paired with categorical cross-entropy.
Not used between hidden layers (it couples all outputs). Temperature scaling
divides logits before softmax to sharpen or flatten distributions — common in
knowledge distillation and LLM sampling.
GELU — Gaussian Error Linear Unit
f(z) = z · Φ(z) where Φ is the standard normal CDF — approximated
as 0.5z(1 + tanh(√(2/π)(z + 0.044715z³))) in most frameworks.
Smooth, non-monotonic near zero, slightly probabilistic intuition (dropout
connection). Default activation in BERT, GPT-family transformers, and many
vision transformers. Slightly more compute than ReLU but often better quality
at scale.
Swish / SiLU
f(z) = z · σ(z) where σ is sigmoid. Self-gated, smooth, unbounded
above, bounded below. Google's empirical search found it competitive with or
better than ReLU on deep nets; PyTorch calls it SiLU. EfficientNet and some
modern CNNs use it. Middle ground between ReLU sparsity and GELU smoothness.
Where each activation belongs
| Layer role | Typical activation | Why |
|---|---|---|
| Hidden (CNN/MLP) | ReLU, Leaky ReLU, GELU, Swish | Nonlinearity without heavy saturation; ReLU for speed, GELU/Swish for quality |
| Binary classification output | Sigmoid | Single probability in (0,1); pairs with binary cross-entropy |
| Multi-class output | Softmax | Mutually exclusive class probabilities; pairs with categorical cross-entropy |
| Multi-label output | Sigmoid per label | Independent probabilities; not softmax (labels are not exclusive) |
| Regression output | Linear (identity) | Unbounded real values; MSE or Huber loss on raw output |
| Transformer FFN | GELU or Swish | Framework and paper defaults; rarely ReLU in modern LLMs |
| RNN/LSTM gates | Sigmoid + tanh | Gates in [0,1], cell state in [−1,1] — architecture-specific |
| Bounded regression (e.g. pixel [0,1]) | Sigmoid or tanh rescale | Hard output constraints when loss alone is insufficient |
A common beginner mistake: softmax hidden activations. Softmax is a normalization across a vector, not a per-neuron nonlinearity — using it inside hidden layers couples every unit in the layer and destroys the independence you want from width.
Activations and gradient flow
During backpropagation, gradients multiply through each layer's activation
derivative. For sigmoid/tanh, |f'(z)| ≤ 0.25 (sigmoid) or
≤ 1 (tanh, but still small when saturated). Multiply twenty
small factors and early layers receive near-zero updates — vanishing gradients.
ReLU's gradient is exactly 1 for positive inputs, so signal passes through unchanged (when the unit is active). That is why ResNets and very deep CNNs became trainable. The tradeoff: dead ReLUs pass zero gradient. GELU and Swish have smooth, non-zero derivatives almost everywhere — fewer dead units, slightly denser activations, more FLOPs.
Exploding gradients are less about activation choice and more about weight scale and depth — but unbounded activations (ReLU, GELU) can let activations grow large if weights are poorly initialized. Pair activations with sensible init (He for ReLU, Xavier for tanh/sigmoid) and batch or layer normalization when training is unstable. Gradient clipping in the optimizer caps update magnitude regardless of activation.
Choosing an activation: decision guide
| Scenario | Recommended | Avoid |
|---|---|---|
| Standard CNN image classifier | ReLU or Swish | Sigmoid in hidden layers |
| Transformer / LLM feed-forward block | GELU (match pretrained) | Changing activation when fine-tuning without re-tuning |
| Small MLP on tabular data | ReLU or GELU | Overthinking — either works with good tuning |
| Many dead ReLU units in logs | Leaky ReLU, PReLU, or lower LR | Stacking more layers before fixing init |
| Binary classifier last layer | Sigmoid + BCE | Softmax with two outputs (works but redundant) |
| 1000-class ImageNet head | Linear logits + softmax + CE | Sigmoid per class (treats labels as independent) |
| Latency-critical edge inference | ReLU (fastest) | GELU if marginal accuracy gain does not justify cost |
| Porting a research paper | Match paper exactly | Swapping GELU for ReLU "because ReLU is standard" |
Activations in modern architectures
Convolutional networks: ReLU dominated ResNet/VGG era; EfficientNet and ConvNeXt moved toward Swish/GELU for marginal accuracy gains. Depthwise separable conv stacks behave similarly — activation choice is rarely the bottleneck; data and architecture matter more.
Transformers: attention layers are linear; nonlinearity lives in the position-wise feed-forward network (two linear layers with activation between). GPT, BERT, LLaMA, and Mistral families overwhelmingly use GELU or SwiGLU (Swish-gated linear unit — a variant that splits the FFN and gates one branch with SiLU). Do not change these when loading pretrained weights.
Residual connections: skip connections add the input to the block output before the next activation, which changes effective gradient paths — activations sit inside the residual branch, not on the skip itself. Pre-activation ResNet (norm → activation → conv) vs post-activation ordering affects where ReLU sits; follow the reference implementation.
Common mistakes
- Softmax in hidden layers — couples neurons; use ReLU/GELU instead.
- Sigmoid + MSE for classification — use cross-entropy with sigmoid/softmax; MSE gives weak gradients near saturation.
- Softmax on multi-label problems — independent labels need sigmoid per output, not mutually exclusive softmax.
- Ignoring dead ReLUs — histogram activations; if >30% units are permanently zero, switch activation or fix init.
- Changing activation when fine-tuning — pretrained weights assume a specific nonlinearity; mismatch hurts convergence.
- Double activation — applying ReLU after a layer that already includes ReLU (e.g. duplicate in custom modules).
- Applying softmax before loss in framework code — most libraries expect raw logits and apply softmax internally in cross-entropy for numerical stability (log-sum-exp trick).
Practitioner checklist
- Confirm every hidden block has a nonlinear activation — no accidental linear stacks.
- Match output activation to loss: sigmoid+BCE, softmax+CE, linear+MSE.
- Default hidden activation: ReLU for CNNs/MLPs; GELU for transformers.
- Use He initialization with ReLU; Xavier with tanh/sigmoid.
- Monitor fraction of zero activations (dead ReLU diagnostic).
- When porting pretrained models, copy activation exactly from the source.
- Profile inference latency if choosing between ReLU and GELU on edge devices.
- For multi-label tasks, verify sigmoid-per-label — not softmax.
- Pass raw logits to the loss function unless the API requires otherwise.
- Document activation choice in model cards — downstream quantisation depends on it.
Key takeaways
- Nonlinear activations are what make depth meaningful — without them, layers collapse to one linear map.
- ReLU is the fast default for CNNs; GELU/Swish dominate transformers and modern vision models.
- Sigmoid and softmax belong at output layers matched to your loss, not scattered through hidden depth.
- Gradient health depends on activation derivatives — saturation causes vanishing gradients; dead ReLUs cause silent unit death.
- When in doubt, match the reference architecture and tune learning rate before swapping activations.
Related reading
- Backpropagation explained — how activation derivatives flow through the computational graph
- Deep learning explained — stacked layers, training loops, and when neural nets beat classical ML
- Loss functions explained — pairing cross-entropy with softmax and BCE with sigmoid
- Neural network optimizers explained — SGD, Adam, learning rates, and gradient clipping