Guide

Information theory explained

Your fraud model reports 94% accuracy but still misses the expensive chargebacks. Your language model's perplexity dropped, yet users say answers feel repetitive. Behind both symptoms sits the same toolkit: information theory — a precise language for surprise, compression, and how much one random variable tells you about another. Claude Shannon formalized it in 1948; today it underpins classification losses, decision-tree splits, active-learning query strategies, variational autoencoders, and LLM evaluation. This guide explains entropy, cross-entropy, Kullback-Leibler (KL) divergence, and mutual information with plain math, a Harbor Payments fraud-scorer worked example, a metric decision table, common pitfalls, and a practitioner checklist alongside our cross-entropy guide, loss functions overview, and active learning guide.

What information theory measures

Information theory answers one question in many disguises: how surprising is an outcome? A fair coin flip carries one bit of surprise per flip — you cannot predict heads or tails. A loaded coin that lands heads 99% of the time carries almost zero surprise when it lands heads, but roughly 6.6 bits when it lands tails (because tails is rare).

Shannon expressed surprise as self-information for event x with probability p(x):

I(x) = −log₂ p(x) bits

Rare events are informative; common events carry little news. The base matters: natural log (nats) appears in PyTorch losses; base-2 (bits) appears in communication engineering; base-e appears in some textbooks. The formulas differ only by a constant factor — always check which log your library uses.

Core quantities

Entropy H(X) — average surprise of random variable X; upper bound on lossless compression rate.
Cross-entropy H(P, Q) — average surprise when reality follows P but you encode with model Q.
KL divergence D_KL(P || Q) — extra bits wasted by using Q instead of the true P; always ≥ 0.
Mutual information I(X; Y) — reduction in uncertainty about Y after observing X; measures dependency.
Perplexity — 2^H for language models; "effective vocabulary size" per token step.
Channel capacity — maximum reliable throughput through a noisy channel (less common in day-to-day ML, foundational for coding theory).

Shannon entropy

For discrete distribution P over outcomes x:

H(P) = − Σ_x P(x) log P(x)

Entropy is maximized when all outcomes are equally likely. A fair eight-sided die has entropy log₂(8) = 3 bits per roll. A die that always shows 4 has entropy 0 — there is nothing new to learn.

Interpretation as compression: you cannot losslessly encode samples from P using fewer than H(P) bits per symbol on average (Shannon's source coding theorem). Huffman and arithmetic codes approach this bound. In ML, low-entropy label distributions (99% negatives in fraud detection) mean most rows are predictable — and classifiers can look accurate while contributing little information per prediction.

Joint and conditional entropy: H(X, Y) measures surprise of pairs; H(Y | X) measures remaining surprise in Y after seeing X. The chain rule ties them: H(X, Y) = H(X) + H(Y | X). Conditional entropy is what decision trees try to shrink at each split.

Cross-entropy and KL divergence

Cross-entropy compares a true distribution P to a model Q:

H(P, Q) = − Σ_x P(x) log Q(x)

In supervised classification, P is usually a one-hot label (probability 1 on the correct class) and Q is the model's softmax output. Minimizing cross-entropy pushes Q to assign high probability to the true class — see our cross-entropy deep dive for binary and categorical formulas.

KL divergence measures the extra cost of using Q when truth is P:

D_KL(P || Q) = H(P, Q) − H(P)

Because H(P) is fixed for a given dataset, minimizing cross-entropy is equivalent to minimizing KL divergence from empirical labels to the model. KL is asymmetric: D_KL(P || Q) ≠ D_KL(Q || P). "Forward" KL mode-seeks (Q covers all P mass); "reverse" KL mode-collapses (Q stays inside a single P mode). Variational inference and distillation pick direction deliberately.

LLM connection: training minimizes cross-entropy over next-token distributions. At eval, perplexity is exp(average cross-entropy in nats) or 2^{average bits} — lower means the model is less surprised by held-out text. Perplexity correlates with fluency but not factuality; pair it with human eval or grounded benchmarks.

Mutual information

Mutual information quantifies how much knowing X reduces uncertainty about Y:

I(X; Y) = H(Y) − H(Y | X) = H(X) + H(Y) − H(X, Y)

I(X; Y) = 0 if and only if X and Y are independent. It is symmetric and non-negative, but unlike correlation it captures nonlinear relationships — useful when features interact in fraud patterns or sensor fusion.

Decision trees pick splits that maximize information gain — essentially mutual information between a candidate feature and the label (equivalently, reduction in label entropy). Gini impurity is a related heuristic; entropy-based splits align more directly with information theory.

Active learning uses entropy of the model's predicted class distribution as an uncertainty score: high-entropy points are where the model is most unsure and labels are most informative. See active learning explained for pool-based query strategies.

Representation learning: contrastive methods like InfoNCE maximize a lower bound on mutual information between augmented views of the same image. VAEs optimize an ELBO that decomposes into reconstruction (cross-entropy) plus KL to a prior.

Where information theory shows up in ML pipelines

Classification training — cross-entropy / log loss; calibrated probabilities via temperature scaling.
Language models — token cross-entropy loss; perplexity on validation corpora; bits-per-byte for compression comparisons.
Decision trees and random forests — entropy or Gini impurity for split quality.
Feature selection — mutual information between each feature and target (handles nonlinear effects better than Pearson r).
Clustering evaluation — normalized mutual information (NMI) between cluster assignments and ground truth.
Model distillation — KL from teacher softmax to student softmax transfers "dark knowledge" in probability tails.
Variational inference — ELBO = expected log-likelihood minus KL(q || prior); balances fit and regularization.
Anomaly detection — low likelihood (high surprise) under a density model flags outliers.

Decoding strategies in production LLMs reshape Q at inference without retraining — temperature and top-p change the effective distribution from which tokens are sampled. That is information theory applied at serving time, not just in the loss function. See LLM text decoding strategies.

Worked example: Harbor Payments fraud scorer

Harbor Payments trains a gradient-boosted classifier on card transactions. The label distribution is heavily skewed: 0.3% fraud. A naive "always legitimate" baseline achieves 99.7% accuracy but H(Y) ≈ 0.024 bits — almost no surprise per row because negatives dominate.

The team tracks three information-theoretic metrics alongside precision-recall:

Label entropy H(Y) — sanity-checks class balance; spikes if fraud rate shifts after a processor change.
Validation cross-entropy — primary training objective; more sensitive than accuracy to confident false negatives.
Mutual information I(features; Y) — estimated via sklearn's mutual_info_classif on binned numeric features; guides which raw fields justify real-time enrichment cost.

After adding merchant-category embeddings, cross-entropy on the holdout set dropped 8% while accuracy rose only 0.4 points — the model became more confident on the right rows, not merely more biased toward the majority class. For deployment review they threshold on expected cost (fraud dollars × recall) rather than accuracy alone.

Active-learning loop: transactions where predicted fraud probability is near 0.5 (maximum entropy for a binary classifier) are routed to human analysts first. Labeling those high-entropy points improved recall on novel fraud rings faster than random sampling — consistent with information gain theory.

Lesson: on imbalanced problems, entropy and cross-entropy tell you whether the model is learning signal or exploiting priors. Accuracy alone hides the difference.

Information metric decision table

Metric	Measures	Use when	Watch out for
Entropy H(Y)	Label unpredictability	Baseline difficulty, detecting distribution shift	Does not reflect model quality
Cross-entropy / log loss	Model fit to labels	Training classifiers and LMs; probabilistic calibration checks	Scale varies with class count and base rate
KL divergence	Gap between two distributions	Distillation, VAE regularization, comparing softmax outputs	Asymmetric; infinite if Q assigns zero where P is positive
Mutual information	Feature–target dependency	Feature selection, tree splits, understanding representations	Hard to estimate in high dimensions; needs enough samples
Perplexity	Effective branching factor per token	Comparing language models on same tokenizer and corpus	Not comparable across tokenizers; ignores factuality
Information gain	Entropy reduction from a split	Decision tree / rule learning	Biased toward high-cardinality features without pruning

Common pitfalls

Confusing bits and nats — PyTorch CrossEntropyLoss uses natural log; perplexity formulas often assume nats. Convert before comparing to published bit-rate numbers.
Reporting accuracy on low-entropy labels — a majority-class classifier can look excellent while learning nothing informative.
Treating KL as a distance — it is not symmetric and does not satisfy the triangle inequality; use Jensen-Shannon divergence if you need a symmetric measure.
Estimating MI in high dimensions — plug-in estimators break with continuous features and small N; prefer k-NN MI estimators or binning with care.
Comparing perplexity across tokenizers — a BPE model with 32k vocab and a byte-level model are not on the same scale.
Ignoring calibration — low cross-entropy requires well-calibrated probabilities, not just correct argmax labels; check reliability diagrams.
Equating low perplexity with useful outputs — models minimize surprise on training distribution, not user utility; pair with task metrics.

Practitioner checklist

Compute label entropy H(Y) before training to set realistic accuracy expectations on imbalanced data.
Track cross-entropy on validation, not accuracy alone, for probabilistic classifiers and LMs.
When distilling models, log KL(teacher || student) on a fixed eval shard to catch collapse early.
Use entropy-based uncertainty for active-learning queues when labeling budget is limited.
Run mutual-information feature screening before expensive embedding pipelines.
Document log base (ln vs log₂) in experiment configs and published perplexity tables.
For LLMs, report perplexity only on held-out text with the same tokenizer and preprocessing as training.
Pair information metrics with business-cost thresholds — bits do not pay chargebacks directly.

Key takeaways

Entropy is average surprise; it sets the compression limit and quantifies how predictable a distribution is.
Cross-entropy trains most classifiers and language models; minimizing it is equivalent to minimizing KL divergence from labels to predictions.
KL divergence compares distributions asymmetrically — direction matters in distillation and variational inference.
Mutual information finds dependencies for feature selection, tree splits, and representation learning.
On imbalanced problems, information metrics reveal whether your model learns signal or exploits the base rate.