Guide

LLM model distillation explained

Frontier language models are powerful and expensive. A 70-billion-parameter teacher can reason through multi-step problems, but you cannot afford to run it on every user request at the edge or inside a latency-sensitive API. Knowledge distillation transfers capability from a large teacher model into a smaller student by training the student to mimic the teacher's outputs — or, more richly, the teacher's probability distribution over tokens. Done well, a 3B or 7B student can outperform a same-size model trained from scratch on the same data. This guide explains how LLM distillation works in practice, how it differs from fine-tuning and quantization, where synthetic data pipelines fit, and the evaluation mistakes that make distilled models look better on paper than in production.

What distillation is (and is not)

Classical knowledge distillation, introduced by Hinton and colleagues for image classifiers, trains a compact student network to match a teacher's soft labels — the full probability vector over classes — rather than only the single correct class. Soft labels encode relational information: "this image is probably a cat, but it also looks a bit like a dog." A student that learns those dark-horse probabilities generalizes better than one trained on hard one-hot targets alone.

For LLMs, the same idea applies at the token level. The teacher assigns a distribution over the vocabulary at each decoding step; the student is penalized when its distribution diverges from the teacher's. That is logits distillation (or distribution matching). A lighter variant is response distillation: the teacher generates full text completions, and the student is fine-tuned to reproduce those strings with standard next-token cross-entropy. Response distillation is easier to pipeline but throws away the rich uncertainty signal in intermediate logits.

Distillation is not the same as fine-tuning on human labels, though the pipelines overlap. Fine-tuning adapts weights to a task distribution; distillation specifically transfers knowledge from another model. It is also not the same as post-training quantization, which shrinks weight precision without changing what the model learned — though production stacks often combine distillation then quantization for maximum savings.

Teacher-student architecture choices

The teacher is almost always frozen during distillation. You run inference (or cache logits) on a frontier or mid-size model — GPT-4-class APIs, open weights like Llama 70B, or a domain-specialized model you already trust. The student is typically 5–20x smaller by parameter count: a 7B student from a 70B teacher, or a 1.5B student from an 8B teacher.

Capacity mismatch matters. A 125M student cannot absorb everything a 70B teacher knows; distillation works best when the student is large enough to represent the subset of behavior you care about. Microsoft's Phi series and similar "small but capable" models illustrate the pattern: curate high-quality synthetic data from strong teachers, train compact architectures aggressively, and accept that broad world knowledge will be thinner than the teacher's.

Architecture need not be identical. Early distillation research matched CNN layer shapes; with transformers, students often use fewer layers, narrower hidden dimensions, or mixture-of-experts routing to pack capacity efficiently. What must align is the tokenizer and vocabulary — distillation across incompatible tokenizers requires decoding teacher outputs to text and falling back to response distillation.

Soft labels, temperature, and the loss function

At each token position, the teacher produces logits zt over the vocabulary. Applying temperature T softens the distribution:

pi = exp(zi / T) / Σ exp(zj / T)

Higher T spreads probability mass across more tokens, surfacing the teacher's secondary preferences — exactly the dark-horse signal students need. Training typically blends two losses: a distillation loss (KL divergence between student and softened teacher distributions) and a hard-label loss (cross-entropy against ground-truth tokens when human labels exist). The blend coefficient trades off "imitate the teacher" vs "match verified facts."

Temperature choice is empirical. T = 1 during teacher inference for deployment; T = 2 to 5 during distillation is common for extracting softer targets. Too high and the distribution becomes nearly uniform, teaching noise; too low and you recover hard-label training with little distillation benefit.

Response distillation and synthetic data pipelines

Most production LLM distillation today is response distillation via synthetic data, because storing teacher logits for billions of tokens is storage-heavy and API teachers do not expose logits at all. The pipeline looks like this:

  1. Prompt curation — collect or generate diverse prompts covering your target tasks (coding, support, summarization, tool use).
  2. Teacher generation — run the teacher with consistent decoding settings; optionally sample multiple completions per prompt for diversity.
  3. Filtering — reject low-quality outputs with heuristics, smaller judge models, or human review; deduplicate near-copies.
  4. Student training — supervised fine-tuning on (prompt, teacher_response) pairs, often with LoRA for efficiency.

The Alpaca, Vicuna, and countless enterprise "internal GPT" projects follow this pattern. Quality of the synthetic corpus dominates outcome: 50k excellent teacher traces beat 5M mediocre ones. Filtering is where teams under-invest — a student trained on hallucinated teacher outputs learns to hallucinate confidently.

Chain-of-thought distillation extends the idea: the teacher emits reasoning steps before the final answer, and the student learns to reproduce both. That transfers procedural knowledge (how to decompose a problem) that bare answers omit. The trade-off is longer training sequences and inference cost if you keep chain-of-thought at runtime.

Distillation vs fine-tuning vs quantization

Use this decision lens before committing GPU weeks:

  • Prompting + RAG — cheapest if the teacher API cost is acceptable and latency tolerable; no training.
  • Fine-tuning on human data — best when you have verified labels and no suitable teacher; teaches task format and tone directly.
  • Distillation — best when a stronger model exists (API or open weights), you need a smaller deployable model, and you can afford synthetic data generation.
  • Quantization (GPTQ, AWQ, INT4) — apply after distillation or fine-tuning to cut VRAM and bandwidth; does not add new knowledge.

Stacks compose: distill 70B teacher knowledge into 7B student, then quantize the student to INT4 for edge deployment via on-device inference. Skipping distillation and only quantizing the original 7B base model often leaves more capability on the table than the combined path.

Evaluation pitfalls

Distilled models are easy to over-score. Common traps:

  • Train-test contamination — evaluating on prompts the teacher already saw during synthetic data generation inflates metrics.
  • Teacher-aligned benchmarks — high agreement with the teacher on style does not mean factual accuracy improved.
  • Capability collapse — the student nails the distilled task distribution but forgets general knowledge; run broad suites (MMLU subsets, BBH, domain-specific gold sets).
  • Length and format hacking — students learn verbose teacher mannerisms without substance; use reference-based and LLM-judge metrics cautiously per our evaluation guide.

Hold out a human-verified evaluation set that never touched the synthetic pipeline. Measure latency, tokens per second, and dollar cost per 1k requests alongside quality — distillation wins only when the Pareto frontier actually moves.

Production deployment checklist

  • Document teacher version, decoding parameters, and synthetic data filters — students go stale when teachers update.
  • Re-run safety evals; distillation can amplify teacher biases or jailbreak susceptibilities compressed into smaller weights.
  • Monitor output drift in production; fallback to teacher API for low-confidence requests if economics allow.
  • Pair with quantization and batching optimizations; a distilled 7B INT4 model on a single GPU often replaces a 70B FP16 deployment for narrow tasks.
  • Track regression on tasks outside the distillation corpus — students narrow faster than teachers.

Key takeaways

  • Distillation transfers knowledge from a large teacher LLM to a smaller student via soft-label (logits) or response matching.
  • Response distillation through filtered synthetic data is the dominant practical pipeline when teacher logits are unavailable.
  • Temperature-scaled soft labels encode richer supervision than hard tokens alone; blend distillation loss with verified labels when you have them.
  • Distillation complements — not replaces — fine-tuning and post-training quantization; combined stacks maximize quality per dollar and per watt.
  • Evaluate on held-out human-verified data and monitor capability collapse outside the distilled task distribution.

Related reading