Guide
Cross-entropy explained
Cross-entropy is the default training objective for classification in machine learning — from logistic regression through deep neural networks. It measures how much probability mass a model assigns to the correct class: confident wrong answers are punished sharply, while confident correct answers are rewarded. In production dashboards you often see the same quantity called log loss. This guide builds intuition from information theory, walks through binary and categorical formulas, explains why softmax pairs naturally with multiclass cross-entropy, connects the loss to maximum likelihood, covers label smoothing and class weights, a Harbor Payments fraud-scorer worked example, a loss-selection decision table, common pitfalls, and a practitioner checklist. For the broader menu of objectives, see loss functions explained; for how gradients flow from this loss, see gradient descent explained.
Information-theory intuition
Claude Shannon defined entropy H(P) as the average surprise (in bits) when events are drawn from distribution P. For a discrete label y with true probability P(y):
H(P) = − Σ P(y) log P(y)
Cross-entropy H(P, Q) measures surprise when reality follows P but you encode events using a model distribution Q:
H(P, Q) = − Σ P(y) log Q(y)
In supervised classification the true distribution is usually a one-hot vector —
all mass on the correct class, zero elsewhere. Cross-entropy collapses to
−log Q(ytrue): the negative log probability the model
assigned to the right answer. That is why a prediction of 0.01 on the true
class costs −log(0.01) ≈ 4.6 nats, while 0.99 costs only
−log(0.99) ≈ 0.01. The curve is steep near zero, which is exactly
the gradient signal you want when the model is confidently wrong.
KL divergence connection
Cross-entropy decomposes as entropy plus Kullback-Leibler divergence:
H(P, Q) = H(P) + DKL(P || Q). Because H(P) is fixed
for a given label, minimizing cross-entropy is equivalent to minimizing KL
divergence — pushing Q toward P. That equivalence is why cross-entropy is the
natural objective for maximum-likelihood estimation in classification.
Binary cross-entropy (log loss)
For a single example with true label y ∈ {0, 1} and predicted probability
ŷ = P(y=1 | x):
L = −[y log(ŷ) + (1 − y) log(1 − ŷ)]
This is the loss optimized by logistic regression when the link function is sigmoid. Properties that matter in practice:
- Convex in ŷ for a fixed label — one global minimum per example.
- Unbounded above — a predicted probability of 0 on a positive
example yields infinite loss, which is why implementations clip probabilities
(e.g.
ε = 1e-7) for numerical stability. - Calibrated outputs — unlike hinge loss, cross-entropy encourages predicted probabilities to match empirical frequencies when the model is well-specified.
For a batch of N examples, the mean binary cross-entropy is the average log loss
reported in scikit-learn's log_loss and many Kaggle leaderboards.
Lower is better; a perfect classifier approaches 0.
From logits to probabilities
Neural networks usually output a raw logit z; sigmoid converts it:
ŷ = σ(z) = 1 / (1 + e−z). Frameworks often fuse sigmoid
and BCE into a numerically stable binary_cross_entropy_with_logits
op — you should prefer that fused form over manually sigmoid-then-log to avoid
gradient underflow when z is very negative or positive.
Categorical cross-entropy and softmax
When exactly one of K classes is correct (multiclass, mutually exclusive labels),
let qk be the model's predicted probability for class k
and y be the one-hot true vector. Per-example loss:
L = − Σk yk log qk = −log qc
where c is the correct class index. The standard way to produce q is softmax over logits z1…zK:
qk = exp(zk) / Σj exp(zj)
Softmax guarantees probabilities sum to 1 and amplifies the largest logit — a
natural match for single-label classification. As with the binary case, use
cross_entropy_with_logits (or equivalent) rather than separate
softmax + log for stable
backpropagation.
Multi-label vs multiclass
These are easy to confuse:
- Multiclass (one-hot) — exactly one label per example (species, sentiment class). Use softmax + categorical cross-entropy.
- Multi-label — multiple labels can be active (image tags, symptom checklist). Treat each class as an independent binary problem: sigmoid per class + binary cross-entropy summed or averaged. Do not use softmax here.
Sparse integer labels (class index 0…K−1) are equivalent to one-hot — frameworks
accept an integer c and compute −log qc directly.
Maximum likelihood and calibration
Minimizing cross-entropy over a dataset is equivalent to maximizing the log-likelihood of labels under the model's parametric family. That connection explains two deployment facts:
- Training loss ≠ business metric — you optimize log loss, but stakeholders care about precision, recall, or F1. A model can improve log loss while F1 flatlines if scores shift but the optimal threshold stays the same. Always evaluate both.
- Probabilities should be trusted only if calibrated — raw softmax outputs are often overconfident. Check reliability diagrams and consider calibration (temperature scaling, Platt scaling) before using scores for automated decisions.
On imbalanced data, unweighted cross-entropy favors the majority class because reducing loss on frequent negatives dominates the gradient. Remedies — class weights, focal loss, resampling — are covered in class imbalance explained and the broader loss functions guide.
Label smoothing and regularization
Label smoothing replaces a hard one-hot target with a softened
distribution: true class gets probability 1 − α, other classes share
α / (K − 1). Cross-entropy against soft targets discourages the model
from pushing logits toward extreme values, which can improve generalization and
calibration in deep nets — especially vision classifiers trained with heavy
augmentation.
Typical α is 0.05–0.1. Too much smoothing hurts accuracy on clean datasets because the model is penalized for being confident even when confidence is warranted. Smoothing is less common on highly imbalanced fraud or medical tasks where missing a rare positive is costly.
Worked example: Harbor Payments fraud scorer
Harbor Payments trains a gradient-boosted baseline and a small neural net to score card transactions as legitimate (0) or fraudulent (1). The neural head outputs a single logit; training minimizes mean binary cross-entropy on 2.4M labeled rows (0.18% positive rate).
Step 1 — Class weights. Without weights, the model achieves 99.8% accuracy by predicting "legitimate" everywhere while log loss stays poor and fraud recall is zero. Harbor applies weight 550× on positives so cross-entropy gradients reflect the business cost of missed fraud.
Step 2 — Log loss monitoring. Validation log loss drops from 0.042 to 0.011 over 12 epochs while AUC rises from 0.91 to 0.96. The team tracks both — a sudden log-loss spike with stable AUC often signals data pipeline drift, not model regression.
Step 3 — Threshold vs loss. Production blocks transactions with ŷ > 0.35, chosen on a precision-recall curve, not at 0.5. Cross-entropy trained well-calibrated rankings; the operating point is a separate decision. After temperature scaling on a holdout month, predicted 0.35 maps to roughly 32% empirical fraud rate.
Step 4 — Multiclass extension. A follow-on model classifies fraud type (stolen card, account takeover, merchant collusion) with softmax + categorical cross-entropy on confirmed fraud only — three mutually exclusive classes, separate from the binary gate.
Loss selection decision table
| Problem shape | Output activation | Loss | Notes |
|---|---|---|---|
| Binary classification | Sigmoid (or logits + fused BCE) | Binary cross-entropy | Default; add class weights if imbalanced |
| Multiclass, one label | Softmax (or logits + fused CE) | Categorical cross-entropy | K mutually exclusive classes |
| Multi-label | Sigmoid per class | Sum/mean of BCE per label | Not softmax — labels overlap |
| Ordinal (ordered classes) | Softmax or cumulative link | CE or ordinal regression loss | Plain CE ignores order structure |
| Hard-class / long-tail | Softmax | Focal loss or class weights | See loss functions guide |
| Need margin, not probability | Linear scores | Hinge / squared hinge | SVM-style; scores less calibrated |
Common pitfalls
- Softmax on multi-label tasks — forces probabilities to sum to 1 across independent tags; use sigmoid + BCE instead.
- Log of zero — always clip probabilities or use fused logits loss;
raw
log(0)produces NaN gradients. - Reporting accuracy on imbalanced data — 99% accuracy can mean zero fraud caught; pair cross-entropy training with ROC-AUC or PR-AUC evaluation.
- Assuming 0.5 threshold — optimal threshold minimizes business cost, not cross-entropy.
- Mixing train and eval reductions — some APIs default to sum; compare mean log loss across runs for consistency.
- Ignoring label noise — mislabeled rows dominate CE gradients; audit labels when loss plateaus early.
Practitioner checklist
- Confirm label type: multiclass (softmax + CE) vs multi-label (sigmoid + BCE).
- Use framework fused logits losses for numerical stability.
- Apply class weights or focal loss when positives are rare; verify recall, not just log loss.
- Track validation log loss every epoch alongside your primary ranking metric (AUC, F1).
- Calibrate probabilities before using scores for automated dollar decisions.
- Choose deployment threshold on a holdout set — separate from loss minimization.
- Consider label smoothing (α ≈ 0.1) for large multiclass vision models; skip for rare-event detection.
- Document whether reported loss is mean or sum so experiments stay comparable.
Key takeaways
- Cross-entropy is
−logof predicted probability on the true class — steep punishment for confident mistakes. - Binary CE pairs with sigmoid; categorical CE pairs with softmax for mutually exclusive classes.
- Minimizing cross-entropy equals maximum likelihood under the model's distribution.
- Optimize log loss during training; choose thresholds and business metrics at deployment.
- On imbalanced problems, raw cross-entropy is rarely enough — weight, resample, or change the loss.
Related reading
- Loss functions explained — MSE, focal loss, and objective selection
- Logistic regression explained — sigmoid and odds ratios
- Activation functions explained — softmax and sigmoid behavior
- Class imbalance explained — weights and sampling with CE