Guide

Neural network weight initialization explained

Harbor Analytics' warehouse defect classifier was a straightforward 18-layer CNN on 224×224 board images — until training flatlined. Validation accuracy stuck at 52% (barely above the three-class random baseline) while training loss crept down slowly. Gradient histograms showed early convolution layers receiving updates 100× smaller than the final dense head. The bug was not the architecture or learning rate: weights were drawn from Uniform(-0.5, 0.5) with no regard for layer width or activation. Re-initializing with He (Kaiming) normal for ReLU conv blocks and Xavier (Glorot) uniform for the sigmoid output head brought validation accuracy to 91% within eight epochs at the same optimizer settings. Weight initialization sets the starting scale of activations and gradients before the first backward pass — get it wrong and even perfect hyperparameters cannot save you. This guide explains fan-in and fan-out variance, Xavier and He derivations, biases and output layers, transformer and embedding inits, a Harbor Analytics worked example, an initialization decision table, common pitfalls, and a production checklist.

Why initialization scale matters

A neural network is a composition of linear transforms and nonlinear activations. At initialization, each layer multiplies its input by a weight matrix and adds a bias. If weights are too small, activations shrink toward zero as depth increases; gradients flowing backward shrink with them — the vanishing gradient problem described in our vanishing and exploding gradients guide. If weights are too large, activations blow up and gradients explode; weights become NaN within a few steps.

The goal of principled initialization is to keep activation variance and gradient variance roughly stable across layers at step zero — before batch normalization or training dynamics take over. You cannot fix a 50-layer network with a learning rate alone if layer 3 receives negligible gradients on day one.

Initialization interacts with your activation function: sigmoid and tanh saturate near zero gradient when inputs are large in magnitude; ReLU passes half its inputs through unchanged but kills the other half; GELU and SiLU used in transformers have their own variance profiles. The correct init scheme matches the nonlinearity, not just the layer type.

Fan-in, fan-out, and variance preservation

For a linear layer with n_in inputs and n_out outputs, each output is a weighted sum of n_in terms. If inputs have variance Var(x) and weights are independent with mean zero and variance Var(w), the output variance is approximately:

Var(y) ≈ n_in × Var(w) × Var(x)

Fan-in (n_in) is how many inputs feed each neuron. Fan-out (n_out) is how many neurons receive each input's signal on the backward pass. Forward-pass stability wants Var(y) ≈ Var(x), which suggests Var(w) ≈ 1 / n_in. Backward-pass stability wants Var(∂L/∂x) ≈ Var(∂L/∂y), which suggests Var(w) ≈ 1 / n_out. You cannot satisfy both exactly with one scalar variance, so standard schemes compromise.

Xavier / Glorot initialization

Xavier initialization (Glorot and Bengio, 2010) targets linear activations or symmetric nonlinearities like tanh and sigmoid. It sets weight variance to the harmonic mean of fan-in and fan-out:

Var(w) = 2 / (n_in + n_out)

Implementations draw from a uniform distribution on [-√6/(n_in+n_out), +√6/(n_in+n_out)] or a normal with standard deviation √2/(n_in+n_out). Use Xavier for output layers with sigmoid (binary classification) or tanh, and for legacy architectures that still use those activations in hidden layers.

He / Kaiming initialization

He initialization (Kaiming He et al., 2015) accounts for ReLU zeroing roughly half its inputs, doubling the needed variance to preserve activation scale:

Var(w) = 2 / n_in

Normal draws use standard deviation √2 / n_in; uniform uses bounds derived from that variance. This is the default for modern CNNs with ReLU or leaky ReLU hidden layers. PyTorch's nn.init.kaiming_normal_ and TensorFlow's HeNormal implement this directly.

Leaky ReLU and other variants

Leaky ReLU with negative slope a (often 0.01) passes a fraction (1+a²)/2 of variance through. He init generalizes to Var(w) = 2 / ((1+a²) × n_in). For GELU and Swish in transformers, practitioners often use Xavier or truncated normal with small standard deviation (0.02) on linear projections; many pretrained checkpoints ship with their own init recipes that training has already validated.

Biases, output layers, and embeddings

Biases are usually initialized to zero. An exception: the final bias of a classifier can be set to the log-odds of the class prior so initial predictions match label frequency — especially helpful with imbalanced data (see class imbalance in ML).

Output layer weights for classification are often smaller than hidden layers. Some teams use Xavier on the final linear layer even when hidden layers use He, or scale down by an extra factor of 0.1 to keep initial logits near zero and softmax probabilities near uniform.

Embedding tables (tokens, categorical IDs) typically use normal initialization with small standard deviation (0.01–0.02) or uniform in the same range. Too-large embedding init produces extreme logits early in transformer training. Positional encodings are often fixed (sinusoidal) or learned with the same small-scale init as token embeddings.

Batch normalization gamma and beta default to 1 and 0 so the layer starts as identity normalization. Do not apply He scaling to BN parameters — they are not ordinary linear weights.

What not to do

All zeros — every neuron in a layer computes the same output; symmetry never breaks. Hidden weights must be random.
Same constant for every weight — same symmetry problem as zeros for hidden layers.
Large uniform without scaling — Harbor's Uniform(-0.5, 0.5) on a 1024-wide layer is catastrophic; variance grows with fan-in.
He init with sigmoid hidden layers — variance is too high; activations saturate immediately.
Xavier init with ReLU hidden layers — variance halves each layer; deep nets stall.
Reusing pretrained weights with wrong head init — fine-tuning a frozen backbone still needs a sensible classifier head; random large head weights can dominate early gradients.

Worked example: Harbor Analytics defect classifier

Harbor Analytics inspects printed circuit boards with a camera rig. The team trains a ResNet-style CNN to classify each crop as good, solder_bridge, or missing_component. Architecture: stem conv (7×7), four residual stages with ReLU, global average pool, linear head with softmax. Dataset: 240k labeled crops, stratified 80/10/10 split.

Baseline (failed): all conv and linear weights from Uniform(-0.5, 0.5), biases zero. Adam lr=1e-3, batch 64. After 15 epochs: train loss 0.89, val accuracy 52%, early-layer gradient norms < 1e-7.
Diagnosis: activation histograms after epoch 1 show stage-1 outputs clustered near zero; stage-4 outputs have variance 400× stage-1. Classic variance collapse from unscaled init.
Fix: apply kaiming_normal_ to all Conv2d and Linear layers before the head (fan-in mode, nonlinearity='relu'); apply xavier_uniform_ to final classifier weights; set final bias to log([0.82, 0.12, 0.06]) matching training priors.
Result: same optimizer and schedule. Val accuracy 78% epoch 1, 91% epoch 8. No change to learning rate schedule or optimizer.
Production: export init recipe in training config YAML; unit test that first-batch gradient norms per stage fall within 0.5–2.0× of each other before long runs.

Initialization decision table

Layer / activation	Recommended init	Notes
Hidden dense / conv + ReLU	He normal or uniform	Default for ResNet, VGG-style stacks
Hidden + leaky ReLU (slope a)	He with adjusted variance	Scale by `2/((1+a²) n_in)`
Hidden + tanh / sigmoid	Xavier normal or uniform	Rare in modern hidden layers
Linear + GELU (transformer FFN)	Xavier or N(0, 0.02)	Match pretrained model recipe if fine-tuning
Binary output + sigmoid	Xavier; bias = log-odds prior	Helps imbalanced fraud / defect detection
Multi-class output + softmax	Xavier or small normal; bias = log priors	Keep initial logits near zero
Token / category embeddings	N(0, 0.02) or U(±0.04)	Do not use He on large vocab tables
Transfer learning backbone	Pretrained weights	Init head only; backbone already scaled

Common pitfalls

Assuming framework defaults are correct — PyTorch Linear defaults to Kaiming uniform for weights, but custom modules and ported code may not.
Ignoring depth-wise variance — init preserves per-layer variance in expectation; residual connections and normalization change the effective depth signal.
Tuning learning rate before checking init — a 10× lr sweep cannot fix collapsed activations.
Mixing init schemes within the same block — one conv with manual uniform and neighbors with He can unbalance multi-branch architectures.
Orthogonal init everywhere — orthogonal matrices help RNNs; for CNNs He/Xavier is simpler and equally effective.
Skipping init tests on new layer types — depthwise separable conv and grouped conv have different effective fan-in; use framework helpers that accept fan_in mode explicitly.

Production checklist

Document init scheme per layer type in training config (not only in code).
After model construction, log activation mean/variance per stage on one batch before the first optimizer step.
Log gradient norm histograms per stage after the first backward pass.
Match init to activation: He for ReLU, Xavier for sigmoid/tanh outputs.
Set classifier bias to log class priors when labels are imbalanced.
For fine-tuning, init new head layers only; verify backbone weights loaded.
Pair sensible init with gradient clipping (see vanishing and exploding gradients) only if explosions persist after init fix — clipping masks bad init.
Re-run init checks when changing width, depth, or activation (e.g. ReLU to GELU).
Compare against a known-good baseline checkpoint on the same data slice.
Version init recipes alongside model architecture in the model registry.

Key takeaways

Initialization sets the starting signal scale — before optimization, batch norm, or regularization kick in.
Variance must account for fan-in and activation — Xavier for symmetric activations, He for ReLU-family.
Bad init looks like a bad learning rate — flat accuracy and tiny early-layer gradients are red flags.
Output and embedding layers need their own rules — not the same as hidden conv stacks.
Measure once — first-batch activation and gradient stats save days of blind hyperparameter search.