Guide

Activation functions explained

Stack ten linear layers in a neural network and you still have one big linear transformation — matrix multiplies compose, but they never bend the decision boundary. Activation functions are the small nonlinear gates applied after each weighted sum that let deep models approximate arbitrary functions. Pick the wrong one and training stalls (vanishing gradients with sigmoid in a 50-layer CNN) or outputs saturate (ReLU dying on a negative-only stream). This guide explains what activations do, walks through the functions you will actually see in production — ReLU, Leaky ReLU, sigmoid, tanh, softmax, GELU, and Swish — maps them to hidden vs output layers, connects them to gradient flow during backpropagation, and ends with a decision table and checklist for deep learning practitioners.

Why nonlinearity is non-negotiable

A single neuron computes z = w·x + b, then applies an activation a = f(z). Without f, or with f linear (identity), a network of arbitrary depth is equivalent to one weight matrix — it can only learn linear decision surfaces and linear regression fits.

Nonlinear f breaks that collapse. The universal approximation theorem guarantees that a feedforward network with one sufficiently wide hidden layer and a nonlinear activation can approximate continuous functions on compact domains — in practice we use many narrower layers because depth composes features hierarchically (edges, then shapes, then objects in vision).

Activations also bound or reshape signal magnitude. Sigmoid squashes to (0, 1); ReLU zeroes negatives; softmax turns a vector of logits into a probability distribution. Those ranges matter for numerical stability and for matching the loss function at the output layer.

Core activation functions

ReLU — Rectified Linear Unit

f(z) = max(0, z). The default hidden activation for CNNs and many MLPs since 2010. Cheap to compute (one comparison), sparse activations (roughly half the units are zero on random init), and its gradient is either 0 or 1 — no saturation on the positive side, which helps deep networks train.

Dead ReLU problem: if a neuron's weights push all inputs negative, the unit outputs 0 forever and its gradient is 0 — it stops learning. Mitigations: Leaky ReLU, PReLU (learnable slope), better initialization (He init for ReLU nets), or lower learning rates.

Leaky ReLU and variants

f(z) = z if z > 0, else αz with small α (often 0.01). Negative inputs get a tiny gradient instead of zero. Parametric ReLU (PReLU) learns α per channel. ELU and SELU smooth the negative side further; SELU pairs with specific initialization for self-normalizing nets — niche but useful when batch sizes are tiny.

Sigmoid

f(z) = 1 / (1 + e^−z). Outputs in (0, 1), interpretable as probability. Historically the default; now mostly relegated to binary output layers and gates (LSTM forget gates). Problems at depth: gradients shrink as f'(z) = f(z)(1 − f(z)), which peaks at 0.25 and approaches zero when saturated — the vanishing gradient problem that stalled deep nets before ReLU.

Hyperbolic tangent (tanh)

f(z) = (e^z − e^−z) / (e^z + e^−z). Zero-centered output in (−1, 1) — often faster convergence than sigmoid in shallow RNN hidden states. Still saturates at extremes; largely replaced by ReLU in feedforward stacks and by gated mechanisms in modern recurrent and transformer blocks.

Softmax

Applied to a vector z of logits: f(z_i) = e^z_i / Σ_j e^z_j. Outputs sum to 1 — a categorical probability distribution. Standard final layer for multi-class classification paired with categorical cross-entropy. Not used between hidden layers (it couples all outputs). Temperature scaling divides logits before softmax to sharpen or flatten distributions — common in knowledge distillation and LLM sampling.

GELU — Gaussian Error Linear Unit

f(z) = z · Φ(z) where Φ is the standard normal CDF — approximated as 0.5z(1 + tanh(√(2/π)(z + 0.044715z³))) in most frameworks. Smooth, non-monotonic near zero, slightly probabilistic intuition (dropout connection). Default activation in BERT, GPT-family transformers, and many vision transformers. Slightly more compute than ReLU but often better quality at scale.

Swish / SiLU

f(z) = z · σ(z) where σ is sigmoid. Self-gated, smooth, unbounded above, bounded below. Google's empirical search found it competitive with or better than ReLU on deep nets; PyTorch calls it SiLU. EfficientNet and some modern CNNs use it. Middle ground between ReLU sparsity and GELU smoothness.

Where each activation belongs

Layer role	Typical activation	Why
Hidden (CNN/MLP)	ReLU, Leaky ReLU, GELU, Swish	Nonlinearity without heavy saturation; ReLU for speed, GELU/Swish for quality
Binary classification output	Sigmoid	Single probability in (0,1); pairs with binary cross-entropy
Multi-class output	Softmax	Mutually exclusive class probabilities; pairs with categorical cross-entropy
Multi-label output	Sigmoid per label	Independent probabilities; not softmax (labels are not exclusive)
Regression output	Linear (identity)	Unbounded real values; MSE or Huber loss on raw output
Transformer FFN	GELU or Swish	Framework and paper defaults; rarely ReLU in modern LLMs
RNN/LSTM gates	Sigmoid + tanh	Gates in [0,1], cell state in [−1,1] — architecture-specific
Bounded regression (e.g. pixel [0,1])	Sigmoid or tanh rescale	Hard output constraints when loss alone is insufficient

A common beginner mistake: softmax hidden activations. Softmax is a normalization across a vector, not a per-neuron nonlinearity — using it inside hidden layers couples every unit in the layer and destroys the independence you want from width.

Activations and gradient flow

During backpropagation, gradients multiply through each layer's activation derivative. For sigmoid/tanh, |f'(z)| ≤ 0.25 (sigmoid) or ≤ 1 (tanh, but still small when saturated). Multiply twenty small factors and early layers receive near-zero updates — vanishing gradients.

ReLU's gradient is exactly 1 for positive inputs, so signal passes through unchanged (when the unit is active). That is why ResNets and very deep CNNs became trainable. The tradeoff: dead ReLUs pass zero gradient. GELU and Swish have smooth, non-zero derivatives almost everywhere — fewer dead units, slightly denser activations, more FLOPs.

Exploding gradients are less about activation choice and more about weight scale and depth — but unbounded activations (ReLU, GELU) can let activations grow large if weights are poorly initialized. Pair activations with sensible init (He for ReLU, Xavier for tanh/sigmoid) and batch or layer normalization when training is unstable. Gradient clipping in the optimizer caps update magnitude regardless of activation.

Choosing an activation: decision guide

Scenario	Recommended	Avoid
Standard CNN image classifier	ReLU or Swish	Sigmoid in hidden layers
Transformer / LLM feed-forward block	GELU (match pretrained)	Changing activation when fine-tuning without re-tuning
Small MLP on tabular data	ReLU or GELU	Overthinking — either works with good tuning
Many dead ReLU units in logs	Leaky ReLU, PReLU, or lower LR	Stacking more layers before fixing init
Binary classifier last layer	Sigmoid + BCE	Softmax with two outputs (works but redundant)
1000-class ImageNet head	Linear logits + softmax + CE	Sigmoid per class (treats labels as independent)
Latency-critical edge inference	ReLU (fastest)	GELU if marginal accuracy gain does not justify cost
Porting a research paper	Match paper exactly	Swapping GELU for ReLU "because ReLU is standard"

Activations in modern architectures

Convolutional networks: ReLU dominated ResNet/VGG era; EfficientNet and ConvNeXt moved toward Swish/GELU for marginal accuracy gains. Depthwise separable conv stacks behave similarly — activation choice is rarely the bottleneck; data and architecture matter more.

Transformers: attention layers are linear; nonlinearity lives in the position-wise feed-forward network (two linear layers with activation between). GPT, BERT, LLaMA, and Mistral families overwhelmingly use GELU or SwiGLU (Swish-gated linear unit — a variant that splits the FFN and gates one branch with SiLU). Do not change these when loading pretrained weights.

Residual connections: skip connections add the input to the block output before the next activation, which changes effective gradient paths — activations sit inside the residual branch, not on the skip itself. Pre-activation ResNet (norm → activation → conv) vs post-activation ordering affects where ReLU sits; follow the reference implementation.

Common mistakes

Softmax in hidden layers — couples neurons; use ReLU/GELU instead.
Sigmoid + MSE for classification — use cross-entropy with sigmoid/softmax; MSE gives weak gradients near saturation.
Softmax on multi-label problems — independent labels need sigmoid per output, not mutually exclusive softmax.
Ignoring dead ReLUs — histogram activations; if >30% units are permanently zero, switch activation or fix init.
Changing activation when fine-tuning — pretrained weights assume a specific nonlinearity; mismatch hurts convergence.
Double activation — applying ReLU after a layer that already includes ReLU (e.g. duplicate in custom modules).
Applying softmax before loss in framework code — most libraries expect raw logits and apply softmax internally in cross-entropy for numerical stability (log-sum-exp trick).

Practitioner checklist

Confirm every hidden block has a nonlinear activation — no accidental linear stacks.
Match output activation to loss: sigmoid+BCE, softmax+CE, linear+MSE.
Default hidden activation: ReLU for CNNs/MLPs; GELU for transformers.
Use He initialization with ReLU; Xavier with tanh/sigmoid.
Monitor fraction of zero activations (dead ReLU diagnostic).
When porting pretrained models, copy activation exactly from the source.
Profile inference latency if choosing between ReLU and GELU on edge devices.
For multi-label tasks, verify sigmoid-per-label — not softmax.
Pass raw logits to the loss function unless the API requires otherwise.
Document activation choice in model cards — downstream quantisation depends on it.

Key takeaways

Nonlinear activations are what make depth meaningful — without them, layers collapse to one linear map.
ReLU is the fast default for CNNs; GELU/Swish dominate transformers and modern vision models.
Sigmoid and softmax belong at output layers matched to your loss, not scattered through hidden depth.
Gradient health depends on activation derivatives — saturation causes vanishing gradients; dead ReLUs cause silent unit death.
When in doubt, match the reference architecture and tune learning rate before swapping activations.