Guide

Precision, recall and F1 score explained

A fraud detector that never flags anything scores 99.9% accuracy — and catches zero fraud. Accuracy hides failure when one class is rare. Precision asks: of everything we flagged positive, how many were actually positive? Recall asks: of all real positives, how many did we find? F1 combines both into a single number when you need one headline metric. This guide walks through the confusion matrix, when to optimize each metric, threshold tuning, and how to evaluate models on imbalanced data — the bread and butter of supervised classification.

The confusion matrix

Every binary classifier produces four counts when compared to ground truth:

True positives (TP) — correctly predicted positive.
True negatives (TN) — correctly predicted negative.
False positives (FP) — predicted positive, actually negative (Type I error).
False negatives (FN) — predicted negative, actually positive (Type II error).

From these four cells you derive every classification metric. The layout is always the same: actual labels on one axis, predicted on the other. Once you can read a confusion matrix, precision and recall stop being abstract formulas.

Precision

Precision = TP / (TP + FP) — also called positive predictive value. It answers: when the model says "yes," how often is it right? High precision means few false alarms. A medical screening tool that cries wolf on healthy patients has low precision; doctors lose trust and patients get unnecessary procedures.

Recall

Recall = TP / (TP + FN) — also called sensitivity or true positive rate. It answers: of all actual positives, what fraction did we catch? High recall means few misses. A spam filter with low recall lets phishing through your inbox; a cancer detector with low recall misses tumors.

Accuracy and why it misleads

Accuracy = (TP + TN) / (TP + TN + FP + FN). When 99% of transactions are legitimate, a model that always predicts "legitimate" hits 99% accuracy while catching no fraud. For skewed datasets, report precision, recall, and F1 alongside accuracy — or drop accuracy entirely.

The precision-recall tradeoff

Precision and recall pull in opposite directions. Raise the decision threshold (require higher confidence before predicting positive) and you predict positive less often: false positives drop (precision rises) but false negatives rise (recall falls). Lower the threshold and the reverse happens.

There is no free lunch — you choose where to sit on the curve based on business cost:

Optimize precision when false positives are expensive: legal document review, premium upsell targeting, content moderation appeals.
Optimize recall when false negatives are dangerous: disease screening, safety defect detection, malware identification.
Balance both when neither error dominates: product categorization, sentiment tagging, image labeling for search.

Write down the cost of a false positive versus a false negative before you pick a threshold. A model team arguing about "better accuracy" without those numbers is optimizing the wrong thing.

F1 score: harmonic mean of precision and recall

F1 = 2 × (precision × recall) / (precision + recall)

F1 is the harmonic mean, not the arithmetic average. Harmonic mean punishes extreme imbalance: if precision is 1.0 and recall is 0.1, arithmetic mean is 0.55 but F1 is 0.18 — correctly signaling the model is useless for finding positives. Use F1 when you need a single number and both false positives and false negatives matter roughly equally.

F-beta scores

F-beta generalizes F1 by weighting recall beta times more than precision: Fβ = (1 + β²) × (precision × recall) / (β² × precision + recall). F2 weights recall twice as heavily (good for screening). F0.5 weights precision twice as heavily (good when false alarms are costly). F1 is the β = 1 special case.

Threshold tuning and probability outputs

Most classifiers output a probability or score, not a hard label. The default threshold is often 0.5, but that is rarely optimal. Plot precision and recall at every threshold on a precision-recall (PR) curve. The area under the PR curve (AUPRC) summarizes performance across thresholds and is more informative than ROC-AUC when positives are rare.

Practical tuning workflow:

Split data with proper validation — see cross-validation to avoid leakage.
Train the model; collect predicted probabilities on the validation set.
Plot PR curve or sweep thresholds from 0.01 to 0.99.
Pick the threshold that maximizes F1, or meets a recall floor (e.g. recall ≥ 0.95) while maximizing precision.
Lock the threshold and evaluate once on the held-out test set.

In production, monitor precision and recall separately as class distribution drifts. A threshold that worked last quarter may be wrong after a product change floods the system with edge cases.

Multi-class and multi-label metrics

Binary formulas extend to multi-class problems through averaging strategies:

Macro average — compute precision/recall/F1 per class, then average. Treats rare classes equally; use when every class matters (disease types, legal categories).
Micro average — pool all TP/FP/FN globally, then compute. Dominated by frequent classes; use when total error count matters (overall tagging volume).
Weighted average — macro average weighted by class support. Compromise when classes are imbalanced but not ignored.

Multi-label problems (one example can have several labels, like image tags) compute metrics per label and average, or use label-specific thresholds. The confusion matrix becomes more complex, but precision and recall definitions stay the same at each label.

Imbalanced data: what actually helps

Metrics alone do not fix imbalance. Common techniques used alongside precision/recall reporting:

Class weights — penalize mistakes on the minority class more heavily in the loss function.
Resampling — oversample minority or undersample majority (watch for overfitting on duplicates).
Better features — often the highest-leverage fix; see feature engineering for leakage-safe transforms.
Anomaly detection framing — when positives are extremely rare (<0.1%), treat the problem as novelty detection instead of classification.

Always report per-class metrics, not just aggregates. A macro F1 of 0.85 can hide a minority class with 0.2 recall.

Production checklist

Define business costs for FP and FN before choosing a threshold.
Report precision, recall, and F1 — not accuracy alone on skewed data.
Use PR curves and AUPRC when positives are rare; ROC-AUC can look optimistic.
Tune thresholds on validation data; touch the test set once.
Track per-class metrics in production dashboards; alert on recall drops.
Re-evaluate after distribution shift (new user cohorts, seasonal spikes).
Document the chosen threshold and the rationale — future you will need it.

Key takeaways

Precision measures false positive rate among positive predictions; recall measures false negative rate among actual positives.
F1 balances both via harmonic mean — punishes lopsided performance.
Threshold tuning moves you along the precision-recall tradeoff; 0.5 is rarely optimal.
Imbalanced data makes accuracy misleading; use PR curves and per-class metrics.
Macro vs micro averaging answers different questions in multi-class problems.