Guide
Continual learning explained
Harbor Payments' card-fraud model scored 0.91 recall on 2024 Visa and Mastercard patterns. In Q1 2026 the product team added buy-now-pay-later rails and a crypto on-ramp — new merchant categories, velocity signatures, and fraud rings the old model had never seen. The obvious fix was a full retrain on the combined dataset. Recall on BNPL fraud climbed to 0.88, but recall on legacy card skimming patterns fell from 0.91 to 0.74. The model had catastrophically forgotten older attack types while fitting the new distribution. That failure mode is why continual learning exists: update a model on streaming tasks without erasing performance on prior ones. This guide explains catastrophic forgetting, replay-based and regularization-based methods, architecture strategies from progressive networks to per-task adapters, evaluation metrics for lifelong learning, a Harbor Payments incremental fraud update worked example, an approach decision table, common pitfalls, and a production checklist.
What continual learning is (and how it differs from batch retraining)
Continual learning (also called lifelong or incremental
learning) trains a model on a sequence of tasks or data distributions
T1, T2, …, Tn, updating
parameters after each stage while preserving utility on earlier tasks. The
production reality that motivates it:
- New product surfaces introduce label spaces and feature distributions the model never saw at launch.
- Storing and retraining on the entire historical corpus every week is expensive, slow, and may violate data-retention policies.
- Full retrains in regulated domains require re-validation that blocks shipping for days.
Continual learning is not the same as
transfer learning,
which typically moves knowledge from a source task to a single target and stops.
It is also not
multi-task learning,
where all tasks are available simultaneously during training. In continual
learning, task Tk may arrive only after
Tk-1 training data is no longer accessible — the
strict setting — or with limited access to old data, the
relaxed setting most production teams actually operate in.
When you can retain all historical data and retrain from scratch on schedule, that is honest periodic batch retraining. It is often the right baseline. Continual methods earn their complexity when retrain latency, storage cost, or regulatory constraints make full refresh impractical — or when you need same-day adaptation to emerging fraud or abuse patterns.
Catastrophic forgetting: why new data erases old skills
Neural networks optimize a loss landscape shared across all parameters. When you fine-tune on task B alone, gradients push weights toward a minimum that fits B — often far from the minimum that also fit task A. Earlier task performance collapses even though the network could represent both if trained jointly. That is catastrophic forgetting.
Why it hits production models hard
- Class imbalance shifts — rare legacy fraud types get drowned by volume from new payment rails.
- Feature space expansion — new categorical encodings or embedding columns change input geometry; old decision boundaries warp.
- Overfitting small update batches — a week of BNPL labels is tiny relative to years of card history; the optimizer memorizes the new slice.
- Shared representation collapse — deep encoders repurpose early layers for new patterns, destroying features old heads relied on.
Forgetting is related to but distinct from concept drift: drift means the world changed; forgetting means your update procedure threw away knowledge that is still valid. A model can suffer both at once.
Experience replay: mix old examples into new training
The most practical continual-learning family in industry is replay: keep a buffer of past examples (or synthetic stand-ins) and interleave them with new data during each update.
Reservoir and ring buffers
Store a fixed-capacity subset of historical rows — uniform random reservoir sampling for streaming data, or stratified sampling to preserve rare classes. Each incremental training epoch draws a mini-batch that is, say, 70% new-task data and 30% replay. Simple, interpretable, and often strong enough for tabular fraud and recommendation models.
Generative replay
Train a generative model (GAN, VAE, or a small LLM for text) on past tasks; sample synthetic inputs to stand in for deleted raw data. Useful under strict privacy retention limits, but quality depends on the generator — bad synthetic replay can reinforce wrong decision boundaries.
Coreset selection
Instead of random replay, select a small coreset of past examples that best approximate the full old loss landscape (herding, gradient matching, k-center greedy in embedding space). Higher upfront compute, smaller buffer for the same retention quality.
Regularization: penalize changes to important weights
Regularization-based methods add a penalty that keeps parameters close to values that mattered for prior tasks, without storing raw data (appealing under GDPR-style deletion requirements, though replay hybrids usually win on accuracy).
Elastic Weight Consolidation (EWC)
After training on task A, estimate a Fisher information matrix (or diagonal
approximation) indicating which weights were important for A. When training
on task B, add a penalty
λ Σi Fi (θi -
θi*)2 pulling important
weights back toward their post-A values. Hyperparameter
λ trades plasticity vs stability.
Learning without Forgetting (LwF)
Keep a copy of the old model frozen. On new data, match the old model's soft outputs (logits) on shared inputs as a distillation loss while fitting new labels. No raw replay required, but needs representative inputs — often proxied by unlabeled traffic.
Synaptic Intelligence (SI) and memory-aware synapses
Online estimates of parameter importance during training, updated continuously rather than computed post-hoc. Lower overhead than full Fisher matrices; useful for edge devices with many small task shifts.
Architectural strategies: isolate capacity per task
When tasks are structurally different, sharing every layer may be the wrong inductive bias. Architectural continual learning adds or routes parameters instead of fighting over one shared set.
- Progressive networks — freeze columns trained on earlier tasks; add new columns for new tasks with lateral connections. No forgetting by construction; memory grows with task count.
- PackNet / masking — prune and mask weights assigned to old tasks; retrain free weights for new tasks. Fixed parameter budget; complex bookkeeping.
- Per-task adapters — freeze a foundation encoder; train small LoRA or bottleneck adapters per task or per time window. At inference, route to the correct adapter or ensemble adapters with a task classifier. Popular for LLM product lines that add features monthly.
- Dynamic expansion — grow network width or depth when validation loss on the new task plateaus under the current capacity.
How to evaluate continual learning
Accuracy on the latest task alone is misleading. Standard metrics:
- Average accuracy — mean performance across all
tasks after training through
Tn. - Backward transfer — does learning task B improve (positive) or hurt (negative) performance on task A? Negative backward transfer is forgetting.
- Forward transfer — does prior training help or hinder learning B from scratch?
- Forgetting measure — max drop in task-k accuracy after subsequent updates, averaged over tasks.
In production, maintain a frozen evaluation suite per task or per time slice: stratified holdout sets that never enter training, replay buffers, or hyperparameter tuning. Log precision-recall on rare classes separately — mean metrics hide catastrophic collapse on legacy fraud types exactly like Harbor's skimming regression.
Worked example: Harbor Payments incremental fraud update
Starting point: Gradient-boosted tree ensemble (XGBoost) plus a shallow neural scorer on 240 engineered features; 0.91 recall / 0.62 precision on card skimming at 1% review rate.
Task sequence: (1) legacy card rails 2019–2024, (2) BNPL merchants Q1 2026, (3) crypto on-ramp Q2 2026.
Failed approach: Full retrain on all data monthly — week-four retrain after BNPL launch forgot skimming; investigation showed BNPL rows were 40% of new volume but only 3% of skimming labels.
Shipped approach:
- Stratified replay buffer — 50k rows capped, 30% reserved for legacy fraud subtypes regardless of volume; refreshed weekly from anonymized feature stores.
- EWC on neural scorer — diagonal Fisher from task-1
checkpoint;
λ=0.4during BNPL and on-ramp fine-tunes. - Separate BNPL adapter head — two-layer MLP on frozen trunk embeddings for rail-specific patterns; merged at inference via payment-rail routing metadata (no guesswork).
- Evaluation gates — deploy blocked if skimming recall on frozen 2024 holdout drops more than 2 points or BNPL recall is below 0.85.
Outcome: Skimming recall 0.89 (within gate), BNPL recall 0.87, on-ramp recall 0.86. Incremental update trains in 4 hours vs 18 hours for full retrain; no regression on compliance sign-off suite.
Approach decision table
| Scenario | Recommended approach | Trade-off |
|---|---|---|
| Can store all data; retrain weekly is affordable | Periodic full batch retrain (baseline) | Simplest; highest accuracy if compute allows |
| Tabular classifier; moderate task count; some storage OK | Stratified experience replay | Easy to implement; buffer size tuning required |
| Strict data deletion; cannot keep raw rows | EWC, LwF, or generative replay | Weaker than replay hybrids; more hyperparameters |
| LLM product adding features monthly | Frozen backbone + per-task LoRA adapters | Clean routing; adapter proliferation at scale |
| Many heterogeneous tasks; memory not constrained | Progressive networks or multi-head MTL | No forgetting; serving complexity grows |
| Edge device; tiny task shifts | Synaptic Intelligence or small replay | Low overhead; limited capacity |
Common pitfalls
- Skipping the full-retrain baseline — continual methods add complexity; prove incremental beats batch on your metrics and SLA before committing.
- Unstratified replay — uniform sampling erases rare classes faster than the new task does; always stratify by label and time slice.
- Evaluating only on the latest task — ships regressions on legacy traffic silently.
- Mixing replay data into hyperparameter search — leaks old-task signal into tuning; hold a frozen audit set outside all training loops.
- Adapter sprawl without routing — ten LoRA modules and no reliable task ID at inference is an ops incident waiting to happen.
- Confusing drift with forgetting — if skimming patterns genuinely evolved, replay alone will not fix it; you need fresh labels and possibly new features.
- Over-tuning EWC lambda — too high and the model cannot learn the new rail; too low and forgetting returns.
Production checklist
- Define tasks or time slices explicitly; document what “one update” means for your product.
- Build frozen per-task evaluation suites before the first incremental update.
- Benchmark full retrain vs incremental on accuracy, training time, and dollar cost.
- If using replay, cap buffer size, stratify by class, and version the buffer schema with feature pipelines.
- Log backward transfer on every deploy candidate; block release on regression thresholds for critical classes.
- Version checkpoints before and after each task; support rollback without replaying training history.
- Document which weights are frozen, penalized (EWC), or expanded (adapters) in the model card.
- Align with data-retention policy: know whether replay rows, Fisher matrices, or generative models touch PII.
- Schedule periodic full retrains even with continual updates — replay buffers are a patch, not infinite memory.
- Pair with drift monitoring so you know when forgetting vs world-change is the root cause.
Key takeaways
- Continual learning updates models on streams — when full retrain is too slow, too costly, or data is gone.
- Catastrophic forgetting is an optimization artifact — new-task gradients overwrite weights that still matter for old tasks.
- Replay is the practical default — stratified buffers beat exotic methods when you can store even a modest history.
- Regularization and adapters complement replay — EWC and LoRA per rail are production-proven patterns.
- Measure backward transfer — never ship incremental updates judged only on the latest product surface.
Related reading
- Transfer learning explained — one-shot source-to-target adaptation vs lifelong sequences
- Model drift and concept drift explained — when the world changes vs when the model forgets
- LoRA fine-tuning explained — parameter-efficient adapters for per-task updates
- Multi-task learning explained — joint training when all tasks are available together