Guide

Continual learning explained

Harbor Payments' card-fraud model scored 0.91 recall on 2024 Visa and Mastercard patterns. In Q1 2026 the product team added buy-now-pay-later rails and a crypto on-ramp — new merchant categories, velocity signatures, and fraud rings the old model had never seen. The obvious fix was a full retrain on the combined dataset. Recall on BNPL fraud climbed to 0.88, but recall on legacy card skimming patterns fell from 0.91 to 0.74. The model had catastrophically forgotten older attack types while fitting the new distribution. That failure mode is why continual learning exists: update a model on streaming tasks without erasing performance on prior ones. This guide explains catastrophic forgetting, replay-based and regularization-based methods, architecture strategies from progressive networks to per-task adapters, evaluation metrics for lifelong learning, a Harbor Payments incremental fraud update worked example, an approach decision table, common pitfalls, and a production checklist.

What continual learning is (and how it differs from batch retraining)

Continual learning (also called lifelong or incremental learning) trains a model on a sequence of tasks or data distributions T₁, T₂, …, T_n, updating parameters after each stage while preserving utility on earlier tasks. The production reality that motivates it:

New product surfaces introduce label spaces and feature distributions the model never saw at launch.
Storing and retraining on the entire historical corpus every week is expensive, slow, and may violate data-retention policies.
Full retrains in regulated domains require re-validation that blocks shipping for days.

Continual learning is not the same as transfer learning, which typically moves knowledge from a source task to a single target and stops. It is also not multi-task learning, where all tasks are available simultaneously during training. In continual learning, task T_k may arrive only after T_k-1 training data is no longer accessible — the strict setting — or with limited access to old data, the relaxed setting most production teams actually operate in.

When you can retain all historical data and retrain from scratch on schedule, that is honest periodic batch retraining. It is often the right baseline. Continual methods earn their complexity when retrain latency, storage cost, or regulatory constraints make full refresh impractical — or when you need same-day adaptation to emerging fraud or abuse patterns.

Catastrophic forgetting: why new data erases old skills

Neural networks optimize a loss landscape shared across all parameters. When you fine-tune on task B alone, gradients push weights toward a minimum that fits B — often far from the minimum that also fit task A. Earlier task performance collapses even though the network could represent both if trained jointly. That is catastrophic forgetting.

Why it hits production models hard

Class imbalance shifts — rare legacy fraud types get drowned by volume from new payment rails.
Feature space expansion — new categorical encodings or embedding columns change input geometry; old decision boundaries warp.
Overfitting small update batches — a week of BNPL labels is tiny relative to years of card history; the optimizer memorizes the new slice.
Shared representation collapse — deep encoders repurpose early layers for new patterns, destroying features old heads relied on.

Forgetting is related to but distinct from concept drift: drift means the world changed; forgetting means your update procedure threw away knowledge that is still valid. A model can suffer both at once.

Experience replay: mix old examples into new training

The most practical continual-learning family in industry is replay: keep a buffer of past examples (or synthetic stand-ins) and interleave them with new data during each update.

Reservoir and ring buffers

Store a fixed-capacity subset of historical rows — uniform random reservoir sampling for streaming data, or stratified sampling to preserve rare classes. Each incremental training epoch draws a mini-batch that is, say, 70% new-task data and 30% replay. Simple, interpretable, and often strong enough for tabular fraud and recommendation models.

Generative replay

Train a generative model (GAN, VAE, or a small LLM for text) on past tasks; sample synthetic inputs to stand in for deleted raw data. Useful under strict privacy retention limits, but quality depends on the generator — bad synthetic replay can reinforce wrong decision boundaries.

Coreset selection

Instead of random replay, select a small coreset of past examples that best approximate the full old loss landscape (herding, gradient matching, k-center greedy in embedding space). Higher upfront compute, smaller buffer for the same retention quality.

Regularization: penalize changes to important weights

Regularization-based methods add a penalty that keeps parameters close to values that mattered for prior tasks, without storing raw data (appealing under GDPR-style deletion requirements, though replay hybrids usually win on accuracy).

Elastic Weight Consolidation (EWC)

After training on task A, estimate a Fisher information matrix (or diagonal approximation) indicating which weights were important for A. When training on task B, add a penalty λ Σ_i F_i (θ_i - θ_i^*)² pulling important weights back toward their post-A values. Hyperparameter λ trades plasticity vs stability.

Learning without Forgetting (LwF)

Keep a copy of the old model frozen. On new data, match the old model's soft outputs (logits) on shared inputs as a distillation loss while fitting new labels. No raw replay required, but needs representative inputs — often proxied by unlabeled traffic.

Synaptic Intelligence (SI) and memory-aware synapses

Online estimates of parameter importance during training, updated continuously rather than computed post-hoc. Lower overhead than full Fisher matrices; useful for edge devices with many small task shifts.

Architectural strategies: isolate capacity per task

When tasks are structurally different, sharing every layer may be the wrong inductive bias. Architectural continual learning adds or routes parameters instead of fighting over one shared set.

Progressive networks — freeze columns trained on earlier tasks; add new columns for new tasks with lateral connections. No forgetting by construction; memory grows with task count.
PackNet / masking — prune and mask weights assigned to old tasks; retrain free weights for new tasks. Fixed parameter budget; complex bookkeeping.
Per-task adapters — freeze a foundation encoder; train small LoRA or bottleneck adapters per task or per time window. At inference, route to the correct adapter or ensemble adapters with a task classifier. Popular for LLM product lines that add features monthly.
Dynamic expansion — grow network width or depth when validation loss on the new task plateaus under the current capacity.

How to evaluate continual learning

Accuracy on the latest task alone is misleading. Standard metrics:

Average accuracy — mean performance across all tasks after training through T_n.
Backward transfer — does learning task B improve (positive) or hurt (negative) performance on task A? Negative backward transfer is forgetting.
Forward transfer — does prior training help or hinder learning B from scratch?
Forgetting measure — max drop in task-k accuracy after subsequent updates, averaged over tasks.

In production, maintain a frozen evaluation suite per task or per time slice: stratified holdout sets that never enter training, replay buffers, or hyperparameter tuning. Log precision-recall on rare classes separately — mean metrics hide catastrophic collapse on legacy fraud types exactly like Harbor's skimming regression.

Worked example: Harbor Payments incremental fraud update

Starting point: Gradient-boosted tree ensemble (XGBoost) plus a shallow neural scorer on 240 engineered features; 0.91 recall / 0.62 precision on card skimming at 1% review rate.

Task sequence: (1) legacy card rails 2019–2024, (2) BNPL merchants Q1 2026, (3) crypto on-ramp Q2 2026.

Failed approach: Full retrain on all data monthly — week-four retrain after BNPL launch forgot skimming; investigation showed BNPL rows were 40% of new volume but only 3% of skimming labels.

Shipped approach:

Stratified replay buffer — 50k rows capped, 30% reserved for legacy fraud subtypes regardless of volume; refreshed weekly from anonymized feature stores.
EWC on neural scorer — diagonal Fisher from task-1 checkpoint; λ=0.4 during BNPL and on-ramp fine-tunes.
Separate BNPL adapter head — two-layer MLP on frozen trunk embeddings for rail-specific patterns; merged at inference via payment-rail routing metadata (no guesswork).
Evaluation gates — deploy blocked if skimming recall on frozen 2024 holdout drops more than 2 points or BNPL recall is below 0.85.

Outcome: Skimming recall 0.89 (within gate), BNPL recall 0.87, on-ramp recall 0.86. Incremental update trains in 4 hours vs 18 hours for full retrain; no regression on compliance sign-off suite.

Approach decision table

Scenario	Recommended approach	Trade-off
Can store all data; retrain weekly is affordable	Periodic full batch retrain (baseline)	Simplest; highest accuracy if compute allows
Tabular classifier; moderate task count; some storage OK	Stratified experience replay	Easy to implement; buffer size tuning required
Strict data deletion; cannot keep raw rows	EWC, LwF, or generative replay	Weaker than replay hybrids; more hyperparameters
LLM product adding features monthly	Frozen backbone + per-task LoRA adapters	Clean routing; adapter proliferation at scale
Many heterogeneous tasks; memory not constrained	Progressive networks or multi-head MTL	No forgetting; serving complexity grows
Edge device; tiny task shifts	Synaptic Intelligence or small replay	Low overhead; limited capacity

Common pitfalls

Skipping the full-retrain baseline — continual methods add complexity; prove incremental beats batch on your metrics and SLA before committing.
Unstratified replay — uniform sampling erases rare classes faster than the new task does; always stratify by label and time slice.
Evaluating only on the latest task — ships regressions on legacy traffic silently.
Mixing replay data into hyperparameter search — leaks old-task signal into tuning; hold a frozen audit set outside all training loops.
Adapter sprawl without routing — ten LoRA modules and no reliable task ID at inference is an ops incident waiting to happen.
Confusing drift with forgetting — if skimming patterns genuinely evolved, replay alone will not fix it; you need fresh labels and possibly new features.
Over-tuning EWC lambda — too high and the model cannot learn the new rail; too low and forgetting returns.

Production checklist

Define tasks or time slices explicitly; document what “one update” means for your product.
Build frozen per-task evaluation suites before the first incremental update.
Benchmark full retrain vs incremental on accuracy, training time, and dollar cost.
If using replay, cap buffer size, stratify by class, and version the buffer schema with feature pipelines.
Log backward transfer on every deploy candidate; block release on regression thresholds for critical classes.
Version checkpoints before and after each task; support rollback without replaying training history.
Document which weights are frozen, penalized (EWC), or expanded (adapters) in the model card.
Align with data-retention policy: know whether replay rows, Fisher matrices, or generative models touch PII.
Schedule periodic full retrains even with continual updates — replay buffers are a patch, not infinite memory.
Pair with drift monitoring so you know when forgetting vs world-change is the root cause.

Key takeaways

Continual learning updates models on streams — when full retrain is too slow, too costly, or data is gone.
Catastrophic forgetting is an optimization artifact — new-task gradients overwrite weights that still matter for old tasks.
Replay is the practical default — stratified buffers beat exotic methods when you can store even a modest history.
Regularization and adapters complement replay — EWC and LoRA per rail are production-proven patterns.
Measure backward transfer — never ship incremental updates judged only on the latest product surface.