Guide
Information theory explained
Your fraud model reports 94% accuracy but still misses the expensive chargebacks. Your language model's perplexity dropped, yet users say answers feel repetitive. Behind both symptoms sits the same toolkit: information theory — a precise language for surprise, compression, and how much one random variable tells you about another. Claude Shannon formalized it in 1948; today it underpins classification losses, decision-tree splits, active-learning query strategies, variational autoencoders, and LLM evaluation. This guide explains entropy, cross-entropy, Kullback-Leibler (KL) divergence, and mutual information with plain math, a Harbor Payments fraud-scorer worked example, a metric decision table, common pitfalls, and a practitioner checklist alongside our cross-entropy guide, loss functions overview, and active learning guide.
What information theory measures
Information theory answers one question in many disguises: how surprising is an outcome? A fair coin flip carries one bit of surprise per flip — you cannot predict heads or tails. A loaded coin that lands heads 99% of the time carries almost zero surprise when it lands heads, but roughly 6.6 bits when it lands tails (because tails is rare).
Shannon expressed surprise as self-information for event
x with probability p(x):
I(x) = −log₂ p(x) bits
Rare events are informative; common events carry little news. The base matters:
natural log (nats) appears in PyTorch losses; base-2 (bits) appears in
communication engineering; base-e appears in some textbooks. The
formulas differ only by a constant factor — always check which log your library uses.
Core quantities
- Entropy H(X) — average surprise of random variable X; upper bound on lossless compression rate.
- Cross-entropy H(P, Q) — average surprise when reality follows P but you encode with model Q.
- KL divergence DKL(P || Q) — extra bits wasted by using Q instead of the true P; always ≥ 0.
- Mutual information I(X; Y) — reduction in uncertainty about Y after observing X; measures dependency.
- Perplexity — 2H for language models; "effective vocabulary size" per token step.
- Channel capacity — maximum reliable throughput through a noisy channel (less common in day-to-day ML, foundational for coding theory).
Shannon entropy
For discrete distribution P over outcomes x:
H(P) = − Σx P(x) log P(x)
Entropy is maximized when all outcomes are equally likely. A fair eight-sided die has entropy log₂(8) = 3 bits per roll. A die that always shows 4 has entropy 0 — there is nothing new to learn.
Interpretation as compression: you cannot losslessly encode samples from P using fewer than H(P) bits per symbol on average (Shannon's source coding theorem). Huffman and arithmetic codes approach this bound. In ML, low-entropy label distributions (99% negatives in fraud detection) mean most rows are predictable — and classifiers can look accurate while contributing little information per prediction.
Joint and conditional entropy: H(X, Y) measures surprise of pairs;
H(Y | X) measures remaining surprise in Y after seeing X. The chain rule ties them:
H(X, Y) = H(X) + H(Y | X). Conditional entropy is what decision trees
try to shrink at each split.
Cross-entropy and KL divergence
Cross-entropy compares a true distribution P to a model Q:
H(P, Q) = − Σx P(x) log Q(x)
In supervised classification, P is usually a one-hot label (probability 1 on the correct class) and Q is the model's softmax output. Minimizing cross-entropy pushes Q to assign high probability to the true class — see our cross-entropy deep dive for binary and categorical formulas.
KL divergence measures the extra cost of using Q when truth is P:
DKL(P || Q) = H(P, Q) − H(P)
Because H(P) is fixed for a given dataset, minimizing cross-entropy is equivalent to minimizing KL divergence from empirical labels to the model. KL is asymmetric: DKL(P || Q) ≠ DKL(Q || P). "Forward" KL mode-seeks (Q covers all P mass); "reverse" KL mode-collapses (Q stays inside a single P mode). Variational inference and distillation pick direction deliberately.
LLM connection: training minimizes cross-entropy over next-token distributions. At eval, perplexity is exp(average cross-entropy in nats) or 2average bits — lower means the model is less surprised by held-out text. Perplexity correlates with fluency but not factuality; pair it with human eval or grounded benchmarks.
Mutual information
Mutual information quantifies how much knowing X reduces uncertainty about Y:
I(X; Y) = H(Y) − H(Y | X) = H(X) + H(Y) − H(X, Y)
I(X; Y) = 0 if and only if X and Y are independent. It is symmetric and non-negative, but unlike correlation it captures nonlinear relationships — useful when features interact in fraud patterns or sensor fusion.
Decision trees pick splits that maximize information gain — essentially mutual information between a candidate feature and the label (equivalently, reduction in label entropy). Gini impurity is a related heuristic; entropy-based splits align more directly with information theory.
Active learning uses entropy of the model's predicted class distribution as an uncertainty score: high-entropy points are where the model is most unsure and labels are most informative. See active learning explained for pool-based query strategies.
Representation learning: contrastive methods like InfoNCE maximize a lower bound on mutual information between augmented views of the same image. VAEs optimize an ELBO that decomposes into reconstruction (cross-entropy) plus KL to a prior.
Where information theory shows up in ML pipelines
- Classification training — cross-entropy / log loss; calibrated probabilities via temperature scaling.
- Language models — token cross-entropy loss; perplexity on validation corpora; bits-per-byte for compression comparisons.
- Decision trees and random forests — entropy or Gini impurity for split quality.
- Feature selection — mutual information between each feature and target (handles nonlinear effects better than Pearson r).
- Clustering evaluation — normalized mutual information (NMI) between cluster assignments and ground truth.
- Model distillation — KL from teacher softmax to student softmax transfers "dark knowledge" in probability tails.
- Variational inference — ELBO = expected log-likelihood minus KL(q || prior); balances fit and regularization.
- Anomaly detection — low likelihood (high surprise) under a density model flags outliers.
Decoding strategies in production LLMs reshape Q at inference without retraining — temperature and top-p change the effective distribution from which tokens are sampled. That is information theory applied at serving time, not just in the loss function. See LLM text decoding strategies.
Worked example: Harbor Payments fraud scorer
Harbor Payments trains a gradient-boosted classifier on card transactions. The label distribution is heavily skewed: 0.3% fraud. A naive "always legitimate" baseline achieves 99.7% accuracy but H(Y) ≈ 0.024 bits — almost no surprise per row because negatives dominate.
The team tracks three information-theoretic metrics alongside precision-recall:
- Label entropy H(Y) — sanity-checks class balance; spikes if fraud rate shifts after a processor change.
- Validation cross-entropy — primary training objective; more sensitive than accuracy to confident false negatives.
- Mutual information I(features; Y) — estimated via sklearn's
mutual_info_classifon binned numeric features; guides which raw fields justify real-time enrichment cost.
After adding merchant-category embeddings, cross-entropy on the holdout set dropped 8% while accuracy rose only 0.4 points — the model became more confident on the right rows, not merely more biased toward the majority class. For deployment review they threshold on expected cost (fraud dollars × recall) rather than accuracy alone.
Active-learning loop: transactions where predicted fraud probability is near 0.5 (maximum entropy for a binary classifier) are routed to human analysts first. Labeling those high-entropy points improved recall on novel fraud rings faster than random sampling — consistent with information gain theory.
Lesson: on imbalanced problems, entropy and cross-entropy tell you whether the model is learning signal or exploiting priors. Accuracy alone hides the difference.
Information metric decision table
| Metric | Measures | Use when | Watch out for |
|---|---|---|---|
| Entropy H(Y) | Label unpredictability | Baseline difficulty, detecting distribution shift | Does not reflect model quality |
| Cross-entropy / log loss | Model fit to labels | Training classifiers and LMs; probabilistic calibration checks | Scale varies with class count and base rate |
| KL divergence | Gap between two distributions | Distillation, VAE regularization, comparing softmax outputs | Asymmetric; infinite if Q assigns zero where P is positive |
| Mutual information | Feature–target dependency | Feature selection, tree splits, understanding representations | Hard to estimate in high dimensions; needs enough samples |
| Perplexity | Effective branching factor per token | Comparing language models on same tokenizer and corpus | Not comparable across tokenizers; ignores factuality |
| Information gain | Entropy reduction from a split | Decision tree / rule learning | Biased toward high-cardinality features without pruning |
Common pitfalls
- Confusing bits and nats — PyTorch
CrossEntropyLossuses natural log; perplexity formulas often assume nats. Convert before comparing to published bit-rate numbers. - Reporting accuracy on low-entropy labels — a majority-class classifier can look excellent while learning nothing informative.
- Treating KL as a distance — it is not symmetric and does not satisfy the triangle inequality; use Jensen-Shannon divergence if you need a symmetric measure.
- Estimating MI in high dimensions — plug-in estimators break with continuous features and small N; prefer k-NN MI estimators or binning with care.
- Comparing perplexity across tokenizers — a BPE model with 32k vocab and a byte-level model are not on the same scale.
- Ignoring calibration — low cross-entropy requires well-calibrated probabilities, not just correct argmax labels; check reliability diagrams.
- Equating low perplexity with useful outputs — models minimize surprise on training distribution, not user utility; pair with task metrics.
Practitioner checklist
- Compute label entropy H(Y) before training to set realistic accuracy expectations on imbalanced data.
- Track cross-entropy on validation, not accuracy alone, for probabilistic classifiers and LMs.
- When distilling models, log KL(teacher || student) on a fixed eval shard to catch collapse early.
- Use entropy-based uncertainty for active-learning queues when labeling budget is limited.
- Run mutual-information feature screening before expensive embedding pipelines.
- Document log base (ln vs log₂) in experiment configs and published perplexity tables.
- For LLMs, report perplexity only on held-out text with the same tokenizer and preprocessing as training.
- Pair information metrics with business-cost thresholds — bits do not pay chargebacks directly.
Key takeaways
- Entropy is average surprise; it sets the compression limit and quantifies how predictable a distribution is.
- Cross-entropy trains most classifiers and language models; minimizing it is equivalent to minimizing KL divergence from labels to predictions.
- KL divergence compares distributions asymmetrically — direction matters in distillation and variational inference.
- Mutual information finds dependencies for feature selection, tree splits, and representation learning.
- On imbalanced problems, information metrics reveal whether your model learns signal or exploits the base rate.
Related reading
- Cross-entropy explained — binary and categorical loss formulas in practice
- Loss functions explained — when to use MSE, focal loss, and other objectives
- Active learning explained — entropy sampling and label-efficient training
- LLM text decoding strategies explained — reshaping output distributions at inference