Guide

LLM distillation explained

Your support queue routes every ticket through a frontier API at $0.003 per call. Eighty percent of messages are one of six intents — password reset, billing, shipping, refund, technical bug, or “other.” A 3B-parameter student model trained with knowledge distillation matches the teacher on those intents at 40× lower latency and near-zero marginal cost. That is the practical promise of LLM distillation: transfer capability from a large teacher model into a smaller student without replicating the teacher’s full training budget. Distillation sits alongside fine-tuning, quantization, and small language models as a compression stack — but it is not interchangeable with any of them. This guide explains teacher-student training, logits and hidden-state transfer, synthetic data distillation, evaluation methodology, a Harbor Support intent classifier worked example, a decision table for when to distill versus prompt or RAG, common pitfalls, and a production checklist.

What LLM distillation is

Knowledge distillation trains a compact student network to mimic a larger teacher’s behavior on a target task. Geoffrey Hinton introduced the idea for image classifiers in 2015; modern LLM distillation applies the same principle at sequence scale: the student learns not only from ground-truth labels but from the teacher’s soft probability distribution over tokens — the dark knowledge in logits that hard labels discard.

In production LLM stacks, distillation typically means one of:

  • Logits distillation — minimize KL divergence between teacher and student output distributions on the same prompts; the student learns how the teacher ranks plausible next tokens.
  • Sequence-level distillation — train on teacher-generated completions as supervised targets (often called “data distillation” or “synthetic SFT”); simpler to implement but loses fine-grained probability signal.
  • Hidden-state / feature distillation — align intermediate layer representations; used in research and some proprietary training recipes (Phi, Gemma) but heavier to operationalize.
  • Self-distillation — a model distills from its own earlier checkpoint or ensemble; improves calibration without an external teacher.

Distillation compresses behavior, not architecture. You still choose student size, context length, and tokenizer. A distilled 1.5B model cannot magically reason like a 70B teacher on out-of-distribution prompts — it reproduces teacher behavior on the distillation distribution, which is only as good as your prompt coverage and data quality.

How distillation differs from fine-tuning, pruning, and quantization

Teams conflate these compression techniques; each solves a different bottleneck:

  • Fine-tuning (SFT / LoRA) adapts weights to labeled task data. Without a teacher, the student only sees hard targets. Distillation adds soft targets from a stronger model, often improving sample efficiency on narrow tasks.
  • Quantization reduces numeric precision (FP16 to INT4) of an existing model. It cuts memory and speeds inference but does not teach new behavior. Distill first, quantize second is a common pipeline.
  • Pruning removes weights or attention heads. Structural pruning can shrink models but usually needs retraining; distillation is the standard recovery step after aggressive pruning.
  • RAG retrieves external documents at inference time. Distillation internalizes patterns the teacher encodes in answers; RAG supplies fresh facts the student never saw. They combine well: distill routing and formatting, RAG for knowledge updates.

Think of the stack as: distill capability into a smaller model, then quantize for deployment, then route hard cases to a frontier teacher via a cascade. Skipping distillation and only quantizing a general 7B model rarely matches a 3B model distilled on your exact intent taxonomy.

Building a distillation pipeline

Step 1: Define the task boundary

Distillation wins on bounded tasks with measurable accuracy: classification, extraction, structured JSON generation, short-form drafting. Open-ended chat distillation tends to plateau — the student inherits teacher hallucinations and still cannot match breadth. Write a task spec: input schema, output schema, refusal rules, and escalation triggers before generating data.

Step 2: Curate or synthesize training prompts

Prompt diversity determines generalization. Sources include:

  • Production logs (redacted, consent-checked) representing real user phrasing.
  • Template expansion: paraphrase seeds with the teacher model across registers (formal, terse, multilingual).
  • Hard negatives: near-miss examples that confuse similar intents (refund vs. cancellation).

For sequence-level distillation, run the teacher on each prompt with temperature 0.7–1.0, keep top completions, and filter with programmatic checks (valid JSON, label consistency, length bounds). Reject teacher outputs that violate policy — distilling bad behavior cements it.

Step 3: Choose loss functions

Logits distillation combines a standard cross-entropy loss on ground truth with a distillation loss on teacher logits:

L = α · CE(student, labels) + (1 − α) · KL(softmax(teacher/T), softmax(student/T))

Temperature T softens distributions so the student learns relative rankings among plausible tokens. Typical α ranges from 0.3 to 0.7 depending on label quality. When labels are noisy but the teacher is strong, weight KL higher. When you have gold human annotations, weight CE higher.

Step 4: Select and train the student

Start from an instruct-tuned base in the target size class (1B–8B). Full fine-tuning is expensive; LoRA adapters on attention layers often suffice for distillation SFT. Train for 1–3 epochs with early stopping on a held-out eval set — overfitting to teacher quirks is the main risk. Monitor not only task accuracy but calibration (expected calibration error) if you threshold confidence for escalation.

Step 5: Evaluate against the teacher and production baselines

Report: accuracy / F1 on human-labeled eval, agreement rate with teacher, latency p50/p99, cost per 1K requests, and escalation rate when routing low-confidence cases to the teacher. A student that matches teacher accuracy on synthetic data but fails on live slang is a common failure mode — hold out real production slices.

Worked example: Harbor Support intent classifier

Harbor Support receives 12,000 tickets per day across six intents. The team runs GPT-4-class routing for three months, logging (prompt, teacher_label, teacher_confidence). They build a distillation set:

  1. Sample 50K historical tickets with PII scrubbed; stratify by intent and language.
  2. Augment 20K synthetic paraphrases via the teacher using templates (“user cannot log in” → billing vs. password variants).
  3. Train a 2.4B instruct student with sequence-level SFT on teacher labels, then a short logits-distillation pass on 10K examples where logits are available via a local teacher deployment.
  4. Quantize to INT4 for CPU inference at the edge; target p99 < 80ms.
  5. Cascade rule: if student confidence < 0.82 or intent is technical_bug, escalate to teacher with full thread context.

Results on a 2,000-ticket human eval: student matches teacher intent 94.1% (teacher vs. human gold: 95.3%), escalates 18% of traffic, cuts routing cost 76%, and reduces p50 latency from 1.8s to 65ms. Monthly re-distillation on new ticket phrasing prevents drift. This pattern mirrors how many SLM products (Phi, Gemma small) are trained at scale — strong teachers, curated synthetic mixtures, aggressive size targets.

Decision table: distillation vs alternatives

Scenario Recommended approach Why
High-volume narrow classifier / extractor Distill to 1–4B + quantize Teacher quality at student economics; latency matters
Low volume, task still evolving weekly Frontier API + prompt engineering Distillation data goes stale before ROI
Answers need fresh docs (policies, inventory) RAG + small model for query rewrite Facts change; distillation cannot update knowledge daily
500+ high-quality human labels, no teacher budget LoRA SFT on mid-size base Labels sufficient; distillation adds little
Teacher API cost > $2K/month on one task Distillation pays back in 1–2 months Amortize training cost against inference savings
Open-ended creative writing Keep frontier model Students rarely match teacher nuance; poor ROI
On-device privacy requirement Distill + quantize + local cascade No PII leaves device; teacher runs offline in batch jobs

Synthetic data distillation at scale

Modern SLM families lean heavily on synthetic textbooks and instruction mixtures generated by frontier teachers. The recipe is not “prompt GPT once” — it is industrial data engineering:

  • Skill tagging — each synthetic example targets a competency (chain-of-thought math, code repair, safety refusal).
  • Quality filters — deduplication (MinHash), perplexity gates, unit-test execution for code samples, toxicity classifiers.
  • Mixture ratios — balance synthetic vs. human-curated web data; too much synthetic yields brittle, over-formal prose.
  • Iterative rounds — train a mid-checkpoint student, find failure clusters, generate targeted synthetic packs, retrain.

If you do not have frontier-scale budget, narrow synthetic generation to your task domain only. Ten thousand high-diversity intent examples beat a million generic chat transcripts for a router model.

Common pitfalls

  • Distilling teacher errors — uncorrected wrong labels become student ground truth; audit teacher outputs before training.
  • Distribution mismatch — synthetic prompts too formal; student fails on slang and typos from real users.
  • Overfitting to teacher phrasing — student copies hedge words and boilerplate; hurts brand voice diversity.
  • Ignoring capability ceiling — a 1B student cannot distill multi-hop reasoning; escalate instead of pretending.
  • Skipping confidence calibration — raw softmax scores mislead routing; calibrate on a validation set.
  • No regression suite — new distillation runs silently regress rare intents; maintain per-class metrics.
  • Legal / license blind spots — some teacher APIs restrict using outputs to train competing models; read terms of service.
  • Confusing distillation with model extraction attacks — distillation for internal cost savings is legitimate; probing external APIs to clone proprietary models crosses policy and legal lines.

Practitioner checklist

  • Document task boundary, schemas, and escalation rules before generating data.
  • Build a human-labeled eval set (500+ examples) independent of training data.
  • Audit teacher outputs; filter policy violations and low-confidence samples.
  • Balance prompt diversity: real logs, paraphrases, hard negatives, multilingual coverage.
  • Train student from a strong instruct base; prefer LoRA for first iterations.
  • Combine sequence-level SFT with logits KL when teacher logits are accessible.
  • Early-stop on eval loss; track per-class recall, not just aggregate accuracy.
  • Quantize after distillation; benchmark INT4 latency on target hardware.
  • Deploy with confidence-based cascade to teacher on edge cases.
  • Schedule quarterly re-distillation on fresh production samples to counter drift.

Key takeaways

  • LLM distillation transfers teacher behavior into a smaller student via soft logits, synthetic completions, or hidden-state alignment.
  • It complements — not replaces — fine-tuning, quantization, and RAG in a compression stack.
  • Distillation ROI is highest on high-volume, bounded tasks with measurable accuracy.
  • Data quality and prompt diversity matter more than student parameter count alone.
  • Production systems need calibrated cascades so students escalate what they cannot handle.

Related reading