Guide

LLM catastrophic forgetting and continual learning explained

Harbor Legal shipped a second fine-tuning pass on its 7B contract-classifier to handle SaaS master-service agreements — 4,200 new labeled examples, three epochs of full-weight SFT. Accuracy on the new doc type reached 94%. But NDA clause extraction on the original training distribution dropped from 91% to 71%, and indemnity-cap spotting on legacy vendor templates fell 14 points. Offline eval on the new task looked great; nobody ran retention tests on old tasks until a paralegal flagged systematic misses two weeks later. The team rebuilt training around continual learning: a 20% replay buffer of stratified legacy examples, task-specific LoRA adapters with intent-based routing, and a monthly retention gate in CI. NDA accuracy recovered to 89% while SaaS performance held at 93%.

Catastrophic forgetting is when a model loses competence on previously learned tasks after training on new data. It is the default outcome of naive sequential fine-tuning, not an edge case. Continual learning is the family of techniques that let models absorb new knowledge without erasing old skills — replay mixing, elastic weight consolidation (EWC), adapter isolation, regularization toward a frozen base, and disciplined evaluation. This guide explains the plasticity–stability tradeoff, why LLMs forget, practical mitigation stacks for production teams, the Harbor Legal refactor, a technique decision table versus full retrain and RAG-only updates, pitfalls, and a checklist.

What catastrophic forgetting is and why LLMs are vulnerable

Neural networks optimize a single set of shared weights for whatever loss function you present. When you fine-tune on task B, gradient updates that improve B often move weights away from configurations that encoded task A. In small models this was called catastrophic interference; in LLMs the effect is subtler but real: the model may still answer general questions fine while silently degrading on narrow domains you previously optimized.

Why forgetting is worse than it looks in headline benchmarks

Shared representation collapse — attention heads and MLP layers reused across tasks compete for capacity; strong gradients on frequent new examples dominate rare legacy patterns.
Distribution shift masquerading as forgetting — sometimes the model did not “forget” so much as the prompt template or retrieval corpus changed; always bisect tooling before blaming weights.
Full-weight SFT amplifies drift — updating all parameters moves the entire representation; parameter-efficient methods constrain the damage but do not eliminate it without replay or routing.
Long-tail skills disappear first — rare clause types, low-resource languages, and edge-case tool formats are overwritten before core fluency degrades, making forgetting invisible to aggregate perplexity scores.

The inverse problem — deliberately removing influence of specific training data — is machine unlearning. Forgetting during new training is usually unintentional unlearning you did not plan for.

The plasticity–stability tradeoff

Every update faces two pressures:

Plasticity — the model must change enough to learn new patterns (new contract types, updated policies, fresh product catalog).
Stability — the model must preserve weights and behaviors that encode prior tasks.

Naive fine-tuning maximizes plasticity with no stability constraint. Freezing the base model and training only adapters maximizes stability but can underfit genuinely new representations. Production continual learning picks a point on this curve deliberately rather than by accident.

When to worry about forgetting

Sequential fine-tuning passes on the same base (monthly domain updates).
Multi-tenant customization where one customer’s adapter must not leak into another’s behavior.
Regulated domains where prior compliance behaviors must persist after policy updates.
Tool-calling agents that gain new functions but must keep old API schemas reliable.

Continual learning techniques that work in practice

Replay buffers (experience replay)

Mix a fixed percentage of exemplars from prior tasks into every new training batch — typically 10–30% depending on task similarity. Store stratified slices, not random samples: cover each legacy label, locale, and document template at least once. Replay is the highest-ROI mitigation for most teams because it is simple, debuggable, and compatible with standard SFT pipelines.

Elastic Weight Consolidation (EWC)

After training on task A, estimate which parameters matter most for A (via Fisher information). When training on task B, add a penalty term that discourages large moves on those parameters. EWC helps when replay storage is limited but adds hyperparameters and compute; it pairs well with LoRA when the penalty applies to adapter weights only.

Task-specific adapters and routing

Train a separate LoRA adapter per task or customer, keep the base frozen, and route requests by intent classifier, metadata tag, or tenant ID. Forgetting across tasks drops to near zero because weights never co-train. Cost: serving complexity, router accuracy, and adapter proliferation governance.

Regularization toward the base checkpoint

Add a KL or L2 penalty pulling the fine-tuned distribution (or weights) toward the frozen base model. Lightweight and common in RLHF; alone it often under-constrains full SFT but works as a stabilizer combined with replay.

Periodic full retrain on merged data

When task count stays small and data volume is manageable, merge all historical labeled data and retrain from the original base on each major release. Forgetting goes to zero; cost scales with total data. Many legal, medical, and finance teams prefer this honesty over clever incremental tricks until data volume forces otherwise.

Model merging (SLERP / TIES)

Train task-specific adapters or checkpoints independently, then merge weights with interpolation or conflict-resolution algorithms. Useful for open-weight models; less common in proprietary API fine-tuning but growing for on-prem deployments.

Evaluation: retention tests are not optional

A benchmark on the new task alone proves plasticity, not stability. Every fine-tuning pipeline needs a retention suite: frozen held-out sets from each prior task, run before and after every training pass. Block promotion if any legacy task drops more than an agreed threshold (e.g., 3 absolute points on F1).

Per-task dashboards — never report only aggregate accuracy; slice by document type, language, and tool schema version.
Pair with online eval — wire retention metrics into online evaluation so production drift on legacy flows triggers rollback.
Regression in CI — treat retention suites like unit tests; fail builds when new adapter merges degrade old tasks.
Human spot checks on tail classes — automated metrics miss qualitative tone and formatting regressions on rare templates.

Harbor Legal refactor: from silent regression to gated continual learning

Harbor Legal’s first fine-tune covered NDAs, vendor MSAs, and employment agreements — 18k examples, 91% clause-level F1 on NDAs. The SaaS expansion added subscription terms, auto-renewal clauses, and SLA credit language. Full-weight SFT on SaaS-only data for three epochs improved the new head but washed out attention patterns for mutual-NDA reciprocity and carve-out lists.

The fix had four layers:

Retention suite in CI — 1,200 frozen NDA and vendor examples; build fails if any slice drops >3 F1 points.
20% replay buffer — every SaaS training batch mixed stratified legacy examples; ratio tuned on validation retention curves.
Task adapters — separate LoRA for SaaS vs legacy corporate templates; intent router on document title + first-page embedding.
Monthly merge review — quarterly decision whether merged full retrain is cheaper than adapter sprawl (currently eight adapters).

Post-refactor: SaaS F1 93%, NDA F1 89% (vs 71% after naive SFT), router mis-route rate 2.1% on held-out routing set.

Technique decision table

Approach	Retention quality	Complexity	Best when
Naive sequential SFT	Poor	Lowest	Single-task, disposable prototypes only
Replay buffer mixing	Good	Low	2–5 sequential domain updates, stored legacy data
EWC + replay	Good–very good	Medium	Limited replay storage, similar task families
Per-task LoRA + routing	Very good	Medium–high	Multi-tenant or clearly separable task types
Full retrain on merged corpus	Best	Medium (data ops)	<100k total examples, infrequent major releases
RAG-only knowledge update	N/A (no weight change)	Low–medium	Factual updates, not style or reasoning shifts
Separate model per task	Best isolation	High serving cost	Hard regulatory boundaries between task types

See fine-tuning vs RAG when the “new knowledge” is primarily factual and retrievable rather than behavioral.

Common pitfalls

Evaluating only the new task — the most common production mistake; always run retention suites before deploy.
Replay without stratification — random legacy samples overweight frequent classes; rare clause types still disappear.
Adapter sprawl without governance — twelve orphaned LoRAs nobody owns; schedule quarterly merge-or-delete reviews.
Router trained on stale intents — new doc types mis-route to wrong adapter; monitor routing confidence and fallback to base model.
Confusing forgetting with RAG staleness — model weights fine but retrieval corpus missing new policy; bisect before retraining.
Over-regularizing — heavy KL-to-base prevents learning genuinely new reasoning patterns; tune penalty on validation for both new and old tasks.
Ignoring prompt drift — system prompt changes between training and inference mimic forgetting in evals.
Full SFT when LoRA would suffice — unnecessary weight movement increases forgetting surface area.

Production checklist

Build a frozen retention eval set for every task before the first fine-tune ships.
Block training promotion if any legacy task F1 drops more than agreed threshold.
Mix 10–30% stratified replay from prior tasks into sequential SFT batches.
Prefer LoRA or QLoRA over full-weight updates when adding domains incrementally.
Consider per-task adapters with explicit routing when tasks are separable by metadata.
Log training data version, adapter ID, and base checkpoint hash on every inference.
Run monthly retention eval on production sample; alert on slice-level drift.
Quarterly review: merge adapters, full retrain, or prune unused task heads.
Document plasticity–stability choice in model card; include known retention limits.
When knowledge is factual only, evaluate RAG corpus update before fine-tuning.

Key takeaways

Catastrophic forgetting is the default outcome of naive sequential fine-tuning — new task gains often hide silent losses on legacy domains.
Replay buffers, EWC, adapter routing, and full merged retrains are the practical mitigation stack; most teams should start with stratified replay plus LoRA.
Retention eval suites in CI are as mandatory as tests on the new task — without them you will ship regressions.
Harbor Legal recovered NDA F1 from 71% to 89% by combining 20% replay, task-specific LoRA, and gated promotion — without sacrificing SaaS accuracy.
When updates are purely factual, RAG may avoid weight drift entirely; when behavior or format must change, plan for continual learning upfront.