Guide
Precision, recall and F1 score explained
A fraud detector that never flags anything scores 99.9% accuracy — and catches zero fraud. Accuracy hides failure when one class is rare. Precision asks: of everything we flagged positive, how many were actually positive? Recall asks: of all real positives, how many did we find? F1 combines both into a single number when you need one headline metric. This guide walks through the confusion matrix, when to optimize each metric, threshold tuning, and how to evaluate models on imbalanced data — the bread and butter of supervised classification.
The confusion matrix
Every binary classifier produces four counts when compared to ground truth:
- True positives (TP) — correctly predicted positive.
- True negatives (TN) — correctly predicted negative.
- False positives (FP) — predicted positive, actually negative (Type I error).
- False negatives (FN) — predicted negative, actually positive (Type II error).
From these four cells you derive every classification metric. The layout is always the same: actual labels on one axis, predicted on the other. Once you can read a confusion matrix, precision and recall stop being abstract formulas.
Precision
Precision = TP / (TP + FP) — also called positive predictive value. It answers: when the model says "yes," how often is it right? High precision means few false alarms. A medical screening tool that cries wolf on healthy patients has low precision; doctors lose trust and patients get unnecessary procedures.
Recall
Recall = TP / (TP + FN) — also called sensitivity or true positive rate. It answers: of all actual positives, what fraction did we catch? High recall means few misses. A spam filter with low recall lets phishing through your inbox; a cancer detector with low recall misses tumors.
Accuracy and why it misleads
Accuracy = (TP + TN) / (TP + TN + FP + FN). When 99% of transactions are legitimate, a model that always predicts "legitimate" hits 99% accuracy while catching no fraud. For skewed datasets, report precision, recall, and F1 alongside accuracy — or drop accuracy entirely.
The precision-recall tradeoff
Precision and recall pull in opposite directions. Raise the decision threshold (require higher confidence before predicting positive) and you predict positive less often: false positives drop (precision rises) but false negatives rise (recall falls). Lower the threshold and the reverse happens.
There is no free lunch — you choose where to sit on the curve based on business cost:
- Optimize precision when false positives are expensive: legal document review, premium upsell targeting, content moderation appeals.
- Optimize recall when false negatives are dangerous: disease screening, safety defect detection, malware identification.
- Balance both when neither error dominates: product categorization, sentiment tagging, image labeling for search.
Write down the cost of a false positive versus a false negative before you pick a threshold. A model team arguing about "better accuracy" without those numbers is optimizing the wrong thing.
F1 score: harmonic mean of precision and recall
F1 = 2 × (precision × recall) / (precision + recall)
F1 is the harmonic mean, not the arithmetic average. Harmonic mean punishes extreme imbalance: if precision is 1.0 and recall is 0.1, arithmetic mean is 0.55 but F1 is 0.18 — correctly signaling the model is useless for finding positives. Use F1 when you need a single number and both false positives and false negatives matter roughly equally.
F-beta scores
F-beta generalizes F1 by weighting recall beta times more
than precision: Fβ = (1 + β²) × (precision × recall) / (β² × precision + recall).
F2 weights recall twice as heavily (good for screening). F0.5 weights
precision twice as heavily (good when false alarms are costly). F1 is
the β = 1 special case.
Threshold tuning and probability outputs
Most classifiers output a probability or score, not a hard label. The default threshold is often 0.5, but that is rarely optimal. Plot precision and recall at every threshold on a precision-recall (PR) curve. The area under the PR curve (AUPRC) summarizes performance across thresholds and is more informative than ROC-AUC when positives are rare.
Practical tuning workflow:
- Split data with proper validation — see cross-validation to avoid leakage.
- Train the model; collect predicted probabilities on the validation set.
- Plot PR curve or sweep thresholds from 0.01 to 0.99.
- Pick the threshold that maximizes F1, or meets a recall floor (e.g. recall ≥ 0.95) while maximizing precision.
- Lock the threshold and evaluate once on the held-out test set.
In production, monitor precision and recall separately as class distribution drifts. A threshold that worked last quarter may be wrong after a product change floods the system with edge cases.
Multi-class and multi-label metrics
Binary formulas extend to multi-class problems through averaging strategies:
- Macro average — compute precision/recall/F1 per class, then average. Treats rare classes equally; use when every class matters (disease types, legal categories).
- Micro average — pool all TP/FP/FN globally, then compute. Dominated by frequent classes; use when total error count matters (overall tagging volume).
- Weighted average — macro average weighted by class support. Compromise when classes are imbalanced but not ignored.
Multi-label problems (one example can have several labels, like image tags) compute metrics per label and average, or use label-specific thresholds. The confusion matrix becomes more complex, but precision and recall definitions stay the same at each label.
Imbalanced data: what actually helps
Metrics alone do not fix imbalance. Common techniques used alongside precision/recall reporting:
- Class weights — penalize mistakes on the minority class more heavily in the loss function.
- Resampling — oversample minority or undersample majority (watch for overfitting on duplicates).
- Better features — often the highest-leverage fix; see feature engineering for leakage-safe transforms.
- Anomaly detection framing — when positives are extremely rare (<0.1%), treat the problem as novelty detection instead of classification.
Always report per-class metrics, not just aggregates. A macro F1 of 0.85 can hide a minority class with 0.2 recall.
Production checklist
- Define business costs for FP and FN before choosing a threshold.
- Report precision, recall, and F1 — not accuracy alone on skewed data.
- Use PR curves and AUPRC when positives are rare; ROC-AUC can look optimistic.
- Tune thresholds on validation data; touch the test set once.
- Track per-class metrics in production dashboards; alert on recall drops.
- Re-evaluate after distribution shift (new user cohorts, seasonal spikes).
- Document the chosen threshold and the rationale — future you will need it.
Key takeaways
- Precision measures false positive rate among positive predictions; recall measures false negative rate among actual positives.
- F1 balances both via harmonic mean — punishes lopsided performance.
- Threshold tuning moves you along the precision-recall tradeoff; 0.5 is rarely optimal.
- Imbalanced data makes accuracy misleading; use PR curves and per-class metrics.
- Macro vs micro averaging answers different questions in multi-class problems.
Related reading
- Machine learning fundamentals explained — supervised learning, loss functions, and baseline evaluation
- Overfitting and cross-validation explained — validation splits and model selection without leakage
- Feature engineering explained — encoding, scaling, and target leakage traps
- Computer vision fundamentals explained — mAP and detection metrics beyond binary classification