Guide

Class imbalance in machine learning explained

Class imbalance occurs when one label dominates the training distribution — fraud detection where 0.5% of transactions are fraudulent, medical screening where disease is rare, or spam filters where junk mail is a small fraction of inbox volume. A naive classifier can achieve high accuracy by always predicting the majority class while catching zero positives of interest. Real-world costs are asymmetric: a missed fraud case costs far more than a false alarm. This guide explains how to diagnose imbalance, choose the right precision-recall metrics, apply data-level resampling (random oversampling, undersampling, SMOTE), algorithm-level fixes (class weights, focal loss), tune decision thresholds, avoid leakage in stratified splits, and pick techniques by use case — with a fraud-detection worked example and a production checklist.

Why accuracy lies on skewed data

Consider a dataset with 10,000 rows: 9,900 negatives and 100 positives (1% prevalence). A model that outputs negative for every row scores 99% accuracy — and is useless. The machine learning fundamentals you learned on balanced toy datasets break down here because the loss function optimizes for the majority class unless you intervene.

Imbalance severity is usually expressed as the imbalance ratio (majority count divided by minority count) or prevalence (minority fraction). Ratios above 10:1 warrant attention; ratios above 100:1 (common in fraud and anomaly detection) require deliberate strategy. Before changing the model, quantify business cost: what is the price of a false negative versus a false positive? That cost matrix drives threshold selection more than any single metric.

Metrics that survive imbalance

Replace raw accuracy with metrics that penalize the errors you care about. Precision (of predicted positives, how many are correct?) and recall (of actual positives, how many did you find?) decompose the confusion matrix into actionable numbers. The ROC-AUC measures ranking quality across thresholds but can look optimistic when negatives dominate — a high AUC with near-zero recall at the default 0.5 cutoff is a classic trap.

PR-AUC (area under the precision-recall curve) focuses on the positive class and is often more informative on heavy skew. Also track balanced accuracy (average of per-class recall), Matthews correlation coefficient (MCC) for a single balanced score, and F-beta when recall matters more than precision (beta > 1) or vice versa. Always report metrics on a held-out test set that reflects production prevalence — not on a resampled training distribution.

Data-level fixes: resampling the training set

Random oversampling

Duplicate minority-class rows until classes are balanced. Simple and works with any algorithm, but duplicates inflate overfitting risk — the model memorizes repeated minority examples. Combine with strong regularization or cross-validation that accounts for duplicates.

Random undersampling

Discard majority-class rows until balanced. Fast and reduces training time, but throws away information. Viable when you have millions of negatives and a small, informative minority — use stratified sampling to preserve feature distributions.

SMOTE and variants

Synthetic Minority Over-sampling Technique (SMOTE) creates new minority examples by interpolating between nearest neighbors in feature space rather than copying rows. Reduces memorization but can generate unrealistic points if features are categorical or high-dimensional. Variants include Borderline-SMOTE (focus on borderline cases), ADASYN (adapt density), and SMOTE-NC for mixed numeric/categorical columns. Apply SMOTE only inside cross-validation folds on the training split — never on the full dataset before splitting, or you leak synthetic neighbors into validation.

Hybrid and ensemble sampling

Pipelines like SMOTE + Tomek links (remove noisy majority neighbors after oversampling) or SMOTE + ENN clean decision boundaries. Ensemble methods such as BalancedRandomForest or EasyEnsemble train multiple models on different undersampled subsets and aggregate — often strong baselines without manual ratio tuning.

Algorithm-level fixes: change the loss, not the data

Class weights

Most libraries support class_weight='balanced' or explicit weight vectors that upweight minority errors in the loss. In logistic regression, this reweights the log-loss gradient; in tree models (XGBoost, LightGBM) it adjusts split criteria. Weights are often preferable to resampling when data is large — no synthetic rows, no discarded examples. Tune weights against validation PR-AUC, not training loss.

Cost-sensitive learning

Encode asymmetric misclassification costs directly: a false negative on fraud might cost 100x a false positive. Some frameworks accept a cost matrix; others approximate it via class weights derived from cost ratios.

Focal loss

Popular in object detection and hard-example mining, focal loss down-weights easy majority examples so gradients focus on hard minority cases. Useful in deep learning when resampling batches is awkward. Pair with careful learning-rate scheduling — focal loss changes gradient scale dramatically early in training.

Threshold tuning

Classifiers output probabilities; the default 0.5 cutoff is rarely optimal. Sweep thresholds on a validation set to maximize F-beta, hit a target recall (e.g. catch 95% of fraud), or minimize expected cost. In production, thresholds drift as prevalence shifts — monitor calibration and re-tune quarterly.

Worked example: credit-card fraud at 1% prevalence

A bank has 1 million transactions per month; 10,000 are fraudulent. A gradient-boosted tree with default settings achieves 99.1% accuracy but recall of 12% — catching only 1,200 fraud cases. The remediation path:

Baseline metrics: Report PR-AUC and recall at fixed precision (e.g. precision ≥ 0.80) instead of accuracy.
Class weights: Set scale_pos_weight=99 (ratio of negatives to positives). Recall rises to 78% at the cost of more false positives.
Threshold sweep: Lower cutoff from 0.5 to 0.15 to hit 95% recall; route borderline cases to human review.
Stratified time split: Train on months 1–10, validate on month 11, test on month 12 — never random shuffle (temporal leakage).
Monitor: Track precision@recall=0.95 weekly; alert if it drops 5 points (concept drift signal).

SMOTE was skipped here because tree models on 1M rows with class weights outperformed synthetic oversampling in validation — a common outcome when the minority class is not extremely rare and features are heterogeneous.

Decision table: which technique when

Situation	Start with	Avoid
Mild imbalance (5:1 to 20:1)	Class weights + threshold tuning	Heavy SMOTE (adds noise)
Severe imbalance (100:1+)	Class weights + PR-AUC; consider undersampling ensembles	Accuracy as primary metric
Small dataset (< 5k rows)	SMOTE inside CV + regularized logistic regression	Deep nets without augmentation
Large tabular data (1M+ rows)	Class weights in boosted trees	Full-dataset SMOTE (prohibitively slow)
Deep learning / images	Weighted loss, focal loss, weighted sampling	Naive random oversampling of tensors
Cost asymmetry known	Cost matrix + threshold optimization	Default 0.5 cutoff

Common pitfalls

Resampling before split: SMOTE on the full dataset leaks synthetic points into validation — always resample inside training folds only.
Evaluating on resampled test data: Metrics on balanced test sets misrepresent production performance. Keep test prevalence realistic.
Ignoring temporal structure: Fraud and churn models need time-based splits, not random stratified K-fold.
Chasing ROC-AUC alone: High AUC with useless recall at operational thresholds wastes engineering time.
Duplicate oversampling without regularization: Training loss hits zero while validation recall stalls — classic overfit signature.
Changing prevalence in production: A model tuned for 1% fraud may degrade when attack rates spike; monitor and retrain.

Practitioner checklist

Measure imbalance ratio and document business cost of FP vs FN.
Report PR-AUC, recall@precision, and MCC — not accuracy alone.
Use stratified or time-aware splits; never leak resampled data across folds.
Try class weights first on large tabular data before SMOTE.
If using SMOTE, apply only within training folds via a pipeline.
Sweep decision thresholds on validation data against business KPIs.
Keep test-set prevalence representative of production.
Log score distributions and calibration plots before deployment.
Set alerts for metric drift — imbalance problems worsen silently.
Compare against a majority-class baseline to prove model value.

Key takeaways

High accuracy can hide zero recall when the minority class is rare — always inspect the confusion matrix.
Fix the objective via class weights, cost-sensitive loss, or resampling — often combine weights with threshold tuning.
SMOTE helps small tabular sets but must stay inside cross-validation folds to prevent leakage.
PR-AUC and recall@precision usually beat ROC-AUC for communicating value on skewed data.
Production success depends on threshold policy and drift monitoring, not just training-time sampling tricks.