Guide

LLM machine unlearning explained

Harbor Legal fine-tuned a 7B contract-QA model on 14,000 redacted client agreements. When one client exercised a contractual withdrawal right, the team deleted their files from the RAG corpus and blocked document IDs at retrieval time. A week later, a red-team prompt still extracted a verbatim indemnity clause unique to the withdrawn agreement — 41 tokens matching the source document at 94% overlap. The model had memorized the text during fine-tuning, not merely indexed it. Runtime retrieval controls could not erase what the weights already encoded.

Machine unlearning is the process of removing the influence of specific training examples from a model without retraining from scratch on the entire corpus. Regulators, customers, and copyright holders increasingly expect provable deletion; GDPR's “right to erasure” and similar frameworks apply to models trained on personal or licensed data. This guide explains exact vs approximate unlearning, common LLM methods (gradient ascent, negative preference optimization, selective fine-tuning), evaluation with membership inference and extraction probes, the Harbor Legal refactor, a technique decision table versus full retraining and RAG-only deletion, pitfalls, and a production checklist.

What machine unlearning means for LLMs

In classical ML, exact unlearning means the updated model is statistically indistinguishable from one trained on the dataset minus the forget set. For billion-parameter LLMs, exact unlearning usually requires retraining from a checkpoint before the forget data was introduced — correct but expensive.

Approximate unlearning accepts a tolerance: after the procedure, the model should (a) not reproduce or reveal forget-set content on probing, and (b) preserve utility on a retain set of examples that must stay learned. The two sets must be defined explicitly before any algorithm runs.

Unlearning differs from:

RAG document deletion — removes retrieval access but not weight-level memorization.
Output filtering — blocks known strings at inference; does not stop paraphrased leakage or jailbreak extraction.
Model editing — targeted fact updates (e.g., “the CEO is now X”) rather than bulk example removal.

Teams need unlearning when contracts expire, users withdraw consent, licensed corpora are revoked, poisoned rows must be excised (see data poisoning), or safety incidents require excising harmful fine-tune pairs without discarding months of alignment work on the retain set.

Why weights remember what you deleted from the index

LLMs memorize training snippets, especially rare or high-salience sequences: unique contract clauses, API keys accidentally logged, customer names repeated in support transcripts. Memorization rises with model capacity, training epochs, and low dataset diversity. A single epoch of full fine-tune on a 200-document subset can embed verbatim spans even when PII redaction ran on inputs — structural patterns and rare tokens still fingerprint the source.

Model collapse from repeated synthetic self-training is a different failure mode; unlearning addresses targeted removal, not corpus-wide quality decay. But aggressive unlearning that overfits the forget set can damage retain-set performance — the central trade-off every method must navigate.

Unlearning methods for language models

Gold standard: retrain without the forget set

Retrain from the last checkpoint before forget data entered, or from the base model on retain-set-only data. Guarantees exact unlearning if the pipeline is deterministic and the forget set is fully excluded. Cost scales with model size and retain corpus volume; impractical for weekly deletion requests on 70B models but right for annual compliance audits or post-incident rebuilds.

Gradient ascent on the forget set

Maximize loss (gradient ascent) on forget examples so the model assigns low likelihood to those sequences. Simple to implement: flip the sign on the standard cross-entropy gradient for forget batches. Risk: catastrophic forgetting on adjacent retain concepts if forget and retain sets share vocabulary or topic overlap. Mitigate with mixed batches (ascent on forget, descent on retain) and early stopping when extraction probes pass.

Negative preference optimization (NPO) and relatives

Recent alignment literature frames unlearning as preference learning: treat forget examples as “rejected” completions and retain examples as “chosen.” NPO-style losses push down log-probability on forget text while anchoring to a reference model via KL penalty — similar machinery to DPO but with deletion intent. Often more stable than raw gradient ascent because the reference model preserves retain behavior.

Selective fine-tuning and task vectors

If the forget set entered via a known LoRA adapter, removing or negating that adapter (task arithmetic: subtract the fine-tune delta) can excise a bounded update without touching the base weights. Works when the forget scope matches the adapter scope; fails when memorization diffused into base weights through continued training.

Knowledge editing (limited scope)

Locate-and-edit methods (ROME, MEMIT) update specific factual associations in MLP layers. Useful for single-fact removal (“delete this person's direct quote”) but do not scale to thousands of documents. Pair with corpus-level methods for bulk unlearning.

Evaluation: did unlearning actually work?

Deletion is not proven by “the model refused to answer.” You need adversarial measurement on three axes:

Forget quality

Extraction probes — prefix the first 20–50 tokens of forget documents and measure continuation overlap (ROUGE, exact-match n-grams).
Membership inference attacks (MIA) — classifiers that guess whether a sample was in training; post-unlearning AUC should approach 0.5 (random) for forget-set members.
Closed-book QA — questions answerable only from forget documents should drop to base-model performance or abstention.

Retain utility

Run held-out benchmarks on the retain set: contract-QA accuracy, summarization F1, safety refusal rates. A successful unlearn should move forget metrics without >2–3 point regression on retain tasks. Track perplexity on retain text — large spikes signal collateral damage.

Side effects and re-learning

Monitor whether subsequent fine-tune rounds re-memorize deleted content from cached gradients or stale data lakes. Version control training corpora with cryptographic manifests so deleted IDs cannot re-enter silently.

Harbor Legal contract withdrawal refactor

Harbor Legal's incident involved 847 fine-tune rows derived from one client's agreements (0.6% of the fine-tune set). The retain set was 13,200 rows from other clients plus generic clause templates.

Attempt 1 — RAG deletion only: Removed 12 indexed PDFs and blocklisted source IDs. Extraction probe still recovered 41-token indemnity span (94% overlap). MIA AUC on forget members: 0.81 (still detectable as training data).

Attempt 2 — gradient ascent only: 400 ascent steps on forget batches. Extraction overlap dropped to 8%, but retain-set clause-classification accuracy fell from 86% to 71% — adjacent “indemnity” concepts in other contracts were damaged.

Attempt 3 — NPO with mixed batches: 200 steps with forget-as-rejected pairs, retain-as-chosen pairs, beta=0.4 KL to pre-incident checkpoint. Extraction overlap 3%; MIA AUC 0.54; retain accuracy 84%. Passed legal review threshold (<5% overlap, retain within 2 points of baseline).

Lesson: mixed-batch NPO with reference anchoring beat both RAG-only and naive ascent for small, topic-overlapping forget sets on mid-size fine-tunes.

Technique decision table

Approach	Forget fidelity	Retain preservation	Cost / latency	Best when
Full retrain (exact)	Highest	Highest if pipeline unchanged	Very high GPU-hours	Annual audits, large forget %, regulatory demand
RAG / index deletion only	Low (weights unchanged)	Perfect	Minutes	Data never fine-tuned in; retrieval-only products
Gradient ascent (forget only)	Medium–high	Risky on overlapping topics	Low–medium	Small forget set, low topic overlap with retain
NPO / preference unlearning	Medium–high	Better than ascent alone	Medium	Fine-tuned models, overlapping vocabulary
LoRA adapter removal	High if scoped to adapter	High for base model	Very low	Forget data only in identifiable adapter
Knowledge editing	High for single facts	Variable	Low per fact	One-off PII or credential removal

Common pitfalls

Assuming RAG deletion is enough — any fine-tune or SFT exposure can memorize; probe weights, not just the index.
No retain set defined — unlearning without retain constraints destroys unrelated capabilities.
Stopping at refusals — a model that says “I cannot share that” may still leak under prefix completion or logit lens analysis.
Over-ascent — driving forget loss too far creates gibberish basins that hurt fluency on retain topics sharing tokens.
Ignoring re-ingestion — deleted rows reappear from nightly ETL unless manifests block IDs at ingest.
Single-metric sign-off — extraction overlap alone misses MIA; MIA alone misses paraphrased leakage.
No reference checkpoint — KL anchoring needs a pre-forget snapshot; without it, NPO drifts globally.
Legal vs technical mismatch — counsel may require exact unlearning proof; approximate methods need explicit tolerance in contracts.

Production checklist

Define forget set and retain set with document IDs before any training or unlearning run.
Maintain checkpoint lineage: base, pre-forget fine-tune, post-unlearn artifact.
Run extraction probes and MIA on forget members; block deploy if AUC > 0.6 or overlap > 5%.
Benchmark retain tasks; fail if regression exceeds agreed threshold (e.g., 2 points).
Delete forget data from RAG, object storage, and training manifests in the same change ticket.
Log unlearning runs with hyperparameters, step count, and eval curves for audit.
Re-probe 30 days post-deploy to catch re-memorization from continued learning loops.
Document approximate vs exact unlearning guarantees in customer DPAs.
For LoRA products, isolate client data per adapter to enable cheap adapter-level deletion.
Pair unlearning with access controls and output monitoring; defense in depth.

Key takeaways

Deleting documents from RAG does not unlearn weights — fine-tuned LLMs can memorize and leak text that no longer appears in any index.
Machine unlearning targets a forget set while preserving a retain set; approximate methods trade perfect guarantees for feasible GPU cost.
Evaluate with extraction probes and membership inference, not just refusal behavior — paraphrased and prefix-completion leaks are common.
NPO-style unlearning with KL anchoring usually beats naive gradient ascent when forget and retain topics overlap.
Harbor Legal cut verbatim clause recovery from 94% to 3% overlap while holding retain QA at 84% using mixed-batch NPO, not RAG deletion alone.