Guide

Learning curves explained

A learning curve plots how a model performs as you change a key variable — usually training set size, number of epochs, or model complexity. The gap between the training line and the validation line is one of the fastest ways to answer three expensive questions: Do I need more data? Is my model too simple or too complex? Should I keep training or stop? This guide explains epoch-based and data-size curves, how to read classic underfitting and overfitting shapes, score scaling pitfalls, a Harbor Payments fraud-scorer worked example, a curve-pattern decision table, common mistakes, and a practitioner checklist. For the theory behind the train-val gap, see bias-variance tradeoff explained; for how to split data and run k-fold checks, see overfitting and cross-validation.

What a learning curve measures

Every supervised model maps features to labels by minimizing a loss function or maximizing a metric (accuracy, AUC, F1). A learning curve tracks that score on two datasets as a knob turns:

Training score — performance on the data the model was fit on.
Validation score — performance on held-out data the model never saw during that fit.

The horizontal axis is usually one of:

Training set size — subsample 10%, 20%, … 100% of available labeled rows and refit.
Epochs or iterations — refit the same data for more gradient descent steps.
Model complexity — tree depth, polynomial degree, hidden-layer width.

Good curves are smooth, use the same metric on both lines, and come from a fixed validation set that never leaks into training — the same discipline as cross-validation.

Loss vs score orientation

Loss curves slope down as the model improves; accuracy and AUC slope up. Teams often plot validation loss alongside training loss for deep nets, and validation AUC for imbalanced classification. Always label the axis and direction so a rising line is not misread as failure.

Epoch learning curves (same data, more training)

During neural-network training, you typically log training and validation loss every epoch. Three patterns cover most debugging sessions:

Healthy convergence

Both lines fall. Validation loss tracks training loss with a modest, stable gap. Validation may flatten while training keeps dropping slightly — normal mild overfitting. This is where early stopping saves compute by halting at the validation minimum.

Underfitting (high bias)

Both training and validation loss plateau at a high level. Adding epochs does not help. The model lacks capacity or useful features. Fixes: richer architecture, better feature engineering, lower regularization, or a different algorithm family.

Overfitting (high variance)

Training loss keeps falling while validation loss bottoms out and rises — the classic divergence. The model memorizes training noise. Fixes: more data, dropout or weight decay, simpler architecture, label smoothing, or early stopping at the validation minimum.

Data-size learning curves (more labels, refit from scratch)

Data-size curves answer: Will labeling another 10,000 rows move the needle? For each subset size, train a fresh model (same hyperparameters) and score on a fixed validation set.

Both curves low and close — underfitting

Training and validation scores are poor even at 100% of current data, and the gap is small. More data alone will not fix this; you need a better model or features. This is high bias.

Large gap, validation still climbing — benefit from more data

Training score is high but validation lags and has not plateaued as you add rows. The validation curve is still rising at the right edge — a signal that additional labeled data is likely worth the cost. This is high variance; the model can learn but needs examples.

Both curves high and converged — done (for this model class)

Training and validation meet near the top of the plot and flatten. Marginal returns from more data are small. Shift effort to deployment, monitoring, or a different model family rather than endless labeling.

Validation worse than a dummy baseline

If validation AUC is below 0.5 for binary classification or accuracy is below the majority-class rate, the curve is not an overfitting story — check labels, feature leakage, inverted class encoding, or a broken pipeline before tuning complexity.

Worked example: Harbor Payments fraud scorer

Harbor Payments trains a gradient-boosted tree classifier on 240,000 card transactions (0.8% fraud rate). They hold out 40,000 rows for validation and plot two curves.

Epoch curve (same 200k training rows): Training AUC reaches 0.994 by round 400; validation AUC peaks at 0.961 at round 180 then drifts to 0.954. Diagnosis: mild overfitting after round 180. Action: early stopping at 180, increase L2 leaf regularization slightly, keep class weights for the rare positive class.

Data-size curve (refit at 25k / 50k / 100k / 200k rows): At 25k rows validation AUC is 0.91 with a wide train-val gap. At 200k rows validation reaches 0.961 and the gap narrows, but the validation line is still inching upward at the right edge. Diagnosis: model capacity is adequate; more historical fraud labels would help, but diminishing returns are near. Action: prioritize one more quarter of labeled data, then invest in feature store freshness and drift monitoring instead of a tenth labeling sprint.

Without curves, the team might have trained 800 useless extra rounds or paid for 500k labels that would not beat 200k.

Curve-pattern decision table

What you see	Likely cause	First lever to try
Train and val both high loss, small gap	Underfitting / high bias	More features, bigger model, less regularization
Train improves, val rises after a minimum	Overfitting / high variance	Early stopping, dropout, more data, simpler model
Val still climbing at max data size	Data-limited	Label more rows or use transfer / semi-supervised learning
Train perfect, val flat and poor	Severe overfit or leakage	Audit features; check duplicate rows across splits
Erratic val line, train smooth	Small validation set or noisy metric	Enlarge val set; use stratified sampling; smooth with rolling median
Both curves improve then val drops suddenly	LR too high or batch issues	Lower learning rate; check shuffling and batch norm in train mode

Practical tips for plotting

Fix the validation set across all subset sizes so scores are comparable.
Stratify subsamples for imbalanced labels — a 10% slice might contain zero fraud rows otherwise.
Repeat with different seeds and plot mean plus shaded standard deviation; single runs lie on small data.
Match preprocessing — fit scalers and encoders only on each training subset, never on validation.
Use task-appropriate metrics — accuracy misleads on rare events; prefer AUC, PR-AUC, or F1 for Harbor-style fraud.
Log to experiment tracking — curves in MLflow or W&B survive team handoffs.

Common pitfalls

Plotting training accuracy on the training set after augmentation — inflated train score hides overfitting.
Tuning hyperparameters while staring at the validation curve — that set becomes training data; hold a untouched test set.
Subsampling without stratification — tiny slices miss minority classes and produce nonsense curves.
Comparing curves across different metrics — train log-loss vs val AUC on the same chart misleads.
Ignoring time order — random subsampling on time-series data leaks future into past; use rolling splits.
Stopping at one lucky seed — neural nets vary; average three runs before declaring data sufficiency.
Expecting validation to match training — some gap is normal; chase divergence trends, not zero gap.

Practitioner checklist

Plot training and validation metric every epoch for deep models; save the best checkpoint at val minimum.
Run a data-size curve before a major labeling budget request — show stakeholders the projected lift.
Keep validation fixed; refit scalers and encoders inside each training fold only.
Stratify subsamples; verify minority classes exist in every slice.
Compare against a dummy baseline on the same chart — curves below baseline mean pipeline bugs.
Document hyperparameters held constant across the curve — changing LR per slice invalidates comparison.
Pair curves with k-fold cross-validation for final model selection.
After deployment, monitor production metrics — offline curves do not guarantee live performance.

Key takeaways

Learning curves compare training vs validation performance as data, epochs, or complexity change.
A wide persistent gap with validation still rising suggests more labeled data will help.
Both lines poor and close means underfitting — invest in model capacity and features, not labels.
Validation diverging upward while training improves signals overfitting — stop early or regularize.
Curves turn subjective "maybe we need more data" debates into evidence for machine learning planning and budget decisions.