Guide

Cross-entropy explained

Cross-entropy is the default training objective for classification in machine learning — from logistic regression through deep neural networks. It measures how much probability mass a model assigns to the correct class: confident wrong answers are punished sharply, while confident correct answers are rewarded. In production dashboards you often see the same quantity called log loss. This guide builds intuition from information theory, walks through binary and categorical formulas, explains why softmax pairs naturally with multiclass cross-entropy, connects the loss to maximum likelihood, covers label smoothing and class weights, a Harbor Payments fraud-scorer worked example, a loss-selection decision table, common pitfalls, and a practitioner checklist. For the broader menu of objectives, see loss functions explained; for how gradients flow from this loss, see gradient descent explained.

Information-theory intuition

Claude Shannon defined entropy H(P) as the average surprise (in bits) when events are drawn from distribution P. For a discrete label y with true probability P(y):

H(P) = − Σ P(y) log P(y)

Cross-entropy H(P, Q) measures surprise when reality follows P but you encode events using a model distribution Q:

H(P, Q) = − Σ P(y) log Q(y)

In supervised classification the true distribution is usually a one-hot vector — all mass on the correct class, zero elsewhere. Cross-entropy collapses to −log Q(y_true): the negative log probability the model assigned to the right answer. That is why a prediction of 0.01 on the true class costs −log(0.01) ≈ 4.6 nats, while 0.99 costs only −log(0.99) ≈ 0.01. The curve is steep near zero, which is exactly the gradient signal you want when the model is confidently wrong.

KL divergence connection

Cross-entropy decomposes as entropy plus Kullback-Leibler divergence: H(P, Q) = H(P) + D_KL(P || Q). Because H(P) is fixed for a given label, minimizing cross-entropy is equivalent to minimizing KL divergence — pushing Q toward P. That equivalence is why cross-entropy is the natural objective for maximum-likelihood estimation in classification.

Binary cross-entropy (log loss)

For a single example with true label y ∈ {0, 1} and predicted probability ŷ = P(y=1 | x):

L = −[y log(ŷ) + (1 − y) log(1 − ŷ)]

This is the loss optimized by logistic regression when the link function is sigmoid. Properties that matter in practice:

Convex in ŷ for a fixed label — one global minimum per example.
Unbounded above — a predicted probability of 0 on a positive example yields infinite loss, which is why implementations clip probabilities (e.g. ε = 1e-7) for numerical stability.
Calibrated outputs — unlike hinge loss, cross-entropy encourages predicted probabilities to match empirical frequencies when the model is well-specified.

For a batch of N examples, the mean binary cross-entropy is the average log loss reported in scikit-learn's log_loss and many Kaggle leaderboards. Lower is better; a perfect classifier approaches 0.

From logits to probabilities

Neural networks usually output a raw logit z; sigmoid converts it: ŷ = σ(z) = 1 / (1 + e^−z). Frameworks often fuse sigmoid and BCE into a numerically stable binary_cross_entropy_with_logits op — you should prefer that fused form over manually sigmoid-then-log to avoid gradient underflow when z is very negative or positive.

Categorical cross-entropy and softmax

When exactly one of K classes is correct (multiclass, mutually exclusive labels), let q_k be the model's predicted probability for class k and y be the one-hot true vector. Per-example loss:

L = − Σ_k y_k log q_k = −log q_c

where c is the correct class index. The standard way to produce q is softmax over logits z₁…z_K:

q_k = exp(z_k) / Σ_j exp(z_j)

Softmax guarantees probabilities sum to 1 and amplifies the largest logit — a natural match for single-label classification. As with the binary case, use cross_entropy_with_logits (or equivalent) rather than separate softmax + log for stable backpropagation.

Multi-label vs multiclass

These are easy to confuse:

Multiclass (one-hot) — exactly one label per example (species, sentiment class). Use softmax + categorical cross-entropy.
Multi-label — multiple labels can be active (image tags, symptom checklist). Treat each class as an independent binary problem: sigmoid per class + binary cross-entropy summed or averaged. Do not use softmax here.

Sparse integer labels (class index 0…K−1) are equivalent to one-hot — frameworks accept an integer c and compute −log q_c directly.

Maximum likelihood and calibration

Minimizing cross-entropy over a dataset is equivalent to maximizing the log-likelihood of labels under the model's parametric family. That connection explains two deployment facts:

Training loss ≠ business metric — you optimize log loss, but stakeholders care about precision, recall, or F1. A model can improve log loss while F1 flatlines if scores shift but the optimal threshold stays the same. Always evaluate both.
Probabilities should be trusted only if calibrated — raw softmax outputs are often overconfident. Check reliability diagrams and consider calibration (temperature scaling, Platt scaling) before using scores for automated decisions.

On imbalanced data, unweighted cross-entropy favors the majority class because reducing loss on frequent negatives dominates the gradient. Remedies — class weights, focal loss, resampling — are covered in class imbalance explained and the broader loss functions guide.

Label smoothing and regularization

Label smoothing replaces a hard one-hot target with a softened distribution: true class gets probability 1 − α, other classes share α / (K − 1). Cross-entropy against soft targets discourages the model from pushing logits toward extreme values, which can improve generalization and calibration in deep nets — especially vision classifiers trained with heavy augmentation.

Typical α is 0.05–0.1. Too much smoothing hurts accuracy on clean datasets because the model is penalized for being confident even when confidence is warranted. Smoothing is less common on highly imbalanced fraud or medical tasks where missing a rare positive is costly.

Worked example: Harbor Payments fraud scorer

Harbor Payments trains a gradient-boosted baseline and a small neural net to score card transactions as legitimate (0) or fraudulent (1). The neural head outputs a single logit; training minimizes mean binary cross-entropy on 2.4M labeled rows (0.18% positive rate).

Step 1 — Class weights. Without weights, the model achieves 99.8% accuracy by predicting "legitimate" everywhere while log loss stays poor and fraud recall is zero. Harbor applies weight 550× on positives so cross-entropy gradients reflect the business cost of missed fraud.

Step 2 — Log loss monitoring. Validation log loss drops from 0.042 to 0.011 over 12 epochs while AUC rises from 0.91 to 0.96. The team tracks both — a sudden log-loss spike with stable AUC often signals data pipeline drift, not model regression.

Step 3 — Threshold vs loss. Production blocks transactions with ŷ > 0.35, chosen on a precision-recall curve, not at 0.5. Cross-entropy trained well-calibrated rankings; the operating point is a separate decision. After temperature scaling on a holdout month, predicted 0.35 maps to roughly 32% empirical fraud rate.

Step 4 — Multiclass extension. A follow-on model classifies fraud type (stolen card, account takeover, merchant collusion) with softmax + categorical cross-entropy on confirmed fraud only — three mutually exclusive classes, separate from the binary gate.

Loss selection decision table

Problem shape	Output activation	Loss	Notes
Binary classification	Sigmoid (or logits + fused BCE)	Binary cross-entropy	Default; add class weights if imbalanced
Multiclass, one label	Softmax (or logits + fused CE)	Categorical cross-entropy	K mutually exclusive classes
Multi-label	Sigmoid per class	Sum/mean of BCE per label	Not softmax — labels overlap
Ordinal (ordered classes)	Softmax or cumulative link	CE or ordinal regression loss	Plain CE ignores order structure
Hard-class / long-tail	Softmax	Focal loss or class weights	See loss functions guide
Need margin, not probability	Linear scores	Hinge / squared hinge	SVM-style; scores less calibrated

Common pitfalls

Softmax on multi-label tasks — forces probabilities to sum to 1 across independent tags; use sigmoid + BCE instead.
Log of zero — always clip probabilities or use fused logits loss; raw log(0) produces NaN gradients.
Reporting accuracy on imbalanced data — 99% accuracy can mean zero fraud caught; pair cross-entropy training with ROC-AUC or PR-AUC evaluation.
Assuming 0.5 threshold — optimal threshold minimizes business cost, not cross-entropy.
Mixing train and eval reductions — some APIs default to sum; compare mean log loss across runs for consistency.
Ignoring label noise — mislabeled rows dominate CE gradients; audit labels when loss plateaus early.

Practitioner checklist

Confirm label type: multiclass (softmax + CE) vs multi-label (sigmoid + BCE).
Use framework fused logits losses for numerical stability.
Apply class weights or focal loss when positives are rare; verify recall, not just log loss.
Track validation log loss every epoch alongside your primary ranking metric (AUC, F1).
Calibrate probabilities before using scores for automated dollar decisions.
Choose deployment threshold on a holdout set — separate from loss minimization.
Consider label smoothing (α ≈ 0.1) for large multiclass vision models; skip for rare-event detection.
Document whether reported loss is mean or sum so experiments stay comparable.

Key takeaways

Cross-entropy is −log of predicted probability on the true class — steep punishment for confident mistakes.
Binary CE pairs with sigmoid; categorical CE pairs with softmax for mutually exclusive classes.
Minimizing cross-entropy equals maximum likelihood under the model's distribution.
Optimize log loss during training; choose thresholds and business metrics at deployment.
On imbalanced problems, raw cross-entropy is rarely enough — weight, resample, or change the loss.