Guide

Knowledge distillation explained

Your team fine-tuned a 70-billion-parameter language model that answers support tickets with 94% accuracy — but each inference costs $0.08 and takes 1.2 seconds. Product wants sub-200ms responses on a $0.002 budget. Shrinking weights with quantization alone drops accuracy to 81%. Knowledge distillation offers a different path: train a smaller student network to imitate a larger teacher, absorbing not just hard correct labels but the teacher’s full probability distribution over classes — the dark knowledge in soft labels. Pioneered by Geoffrey Hinton and colleagues, distillation is how models like DistilBERT, TinyBERT, and many instruction-tuned small LLMs retain surprising capability at a fraction of the size. This guide covers the teacher-student paradigm, temperature scaling, distillation loss functions, feature-based variants, LLM-specific workflows, and how distillation fits alongside transfer learning, fine-tuning, and supervised learning.

What knowledge distillation is — and what it is not

Distillation is a training procedure, not an architecture. You start with a trained teacher model (large, accurate, slow) and train a smaller student model to match the teacher’s outputs — and optionally its internal representations — on a dataset of inputs. The student learns a compressed approximation of the teacher’s decision boundary.

It is not the same as:

  • Pruning — removing weights or neurons from an existing network without retraining a separate architecture.
  • Quantization — reducing numeric precision (FP32 to INT8) of weights and activations; often combined with distillation for best results.
  • Transfer learning — reusing a pre-trained backbone on a new task; distillation explicitly transfers knowledge from one model to another, which may have a different architecture entirely.

The core insight from Hinton et al. (2015): a teacher’s softmax outputs at high temperature reveal similarity structure between classes that one-hot labels hide. If the teacher assigns 0.7 probability to “cat” and 0.25 to “dog” for an ambiguous image, that relational signal helps the student generalize better than training on “cat” alone.

Soft labels and temperature scaling

Standard classification uses a temperature of 1 in the softmax:

pi = exp(zi/T) / ∑j exp(zj/T)

Raising temperature T (typically 2–20) softens the distribution — peak probabilities flatten and secondary classes gain mass. The teacher generates soft targets at temperature T; the student is trained to match those soft targets, also at temperature T. A common combined loss is:

  • Distillation loss — KL divergence (or cross-entropy) between student and teacher soft outputs, weighted by α (often 0.5–0.9).
  • Student loss — cross-entropy against ground-truth hard labels, weighted by (1 − α).

Hard labels alone anchor the student to real data; soft labels transfer the teacher’s generalization. Tuning α and T is task-dependent: too high T washes out signal; too low T approaches one-hot labels and loses the benefit.

Types of distillation

Response-based (logit) distillation

The student matches the teacher’s final output layer — class logits for classifiers, next-token logits for language models. This is the classic Hinton formulation and the easiest to implement: run the teacher in inference mode, cache soft targets, train the student with the combined loss above.

Feature-based distillation

Intermediate hidden states carry rich structure. Methods like FitNets, AT (attention transfer), and PKD (patient knowledge distillation) add losses that align student and teacher representations at chosen layers — often via 1×1 convolutions or linear projections when layer dimensions differ. Feature distillation helps when the student architecture diverges significantly from the teacher (e.g., fewer layers, different width).

Relation-based distillation

Instead of matching individual activations, match relationships between samples or between neurons — distance matrices, correlation structures, or pairwise similarities in embedding space. Useful when absolute feature magnitudes are hard to align across architectures.

Self-distillation

The same architecture (or a deeper variant) teaches a shallower copy of itself — or an earlier checkpoint teaches a later one. Surprisingly effective: the teacher and student share capacity but different training trajectories, and soft labels regularize the student. Common in vision (ResNet self-distillation) and NLP (BERT-to-BERT compression pipelines).

Distillation for large language models

LLM distillation extends the same principles to autoregressive text generation, with extra practical constraints:

  • Teacher data generation — prompt a large teacher (GPT-4, Claude, Llama 70B) on a diverse prompt set; store completions as training targets for the student. This is data distillation or instruction distillation — the student learns to imitate teacher outputs token by token via standard language modeling loss on teacher-generated text.
  • Logit-level distillation — when you have white-box access to the teacher, match per-token distributions (expensive: teacher forward pass on every training step).
  • Chain-of-thought distillation — teacher produces reasoning traces; student learns both answer and intermediate steps, improving smaller model reasoning on math and logic tasks.
  • Speculative decoding synergy — a small draft model (often distilled) proposes tokens; a large verifier accepts or rejects. Distillation quality directly affects acceptance rate and end-to-end latency.

Models like DistilBERT (40% smaller, 60% faster, ~97% of BERT performance on GLUE) and many “small” instruction models (Phi, Gemma 2B variants, distilled Llama derivatives) demonstrate that aggressive compression is feasible when distillation data is high quality and diverse.

The distillation training pipeline

  1. Train or obtain a strong teacher — distillation cannot transfer knowledge the teacher lacks. Garbage teacher, garbage student.
  2. Choose student architecture — balance capacity gap (student too small cannot fit teacher knowledge) vs deployment constraints (latency, memory, edge hardware).
  3. Build a distillation dataset — training set inputs with teacher soft/hard outputs precomputed (offline) or generated on-the-fly (online, GPU-heavy). Cover the production input distribution; out-of-distribution prompts expose student weaknesses early.
  4. Define the loss — logit KL + hard label CE for classifiers; token CE on teacher text for LLMs; optional feature alignment terms.
  5. Train with regularization — weight decay, dropout, early stopping on a held-out set evaluated against both teacher and ground truth.
  6. Evaluate holistically — accuracy/F1/BLEU on task metrics, latency, memory, and qualitative failure modes. Compare against quantized teacher and student trained without distillation.
  7. Deploy and monitor — track drift; consider periodic re-distillation when the teacher is updated.

Distillation vs other compression techniques

TechniqueWhat it doesTypical gainBest paired with
Knowledge distillation Train smaller model to mimic teacher 2–10× size reduction with modest accuracy loss Quantization, pruning
Quantization (PTQ/QAT) Lower bit-width weights/activations 2–4× speed/memory; some accuracy loss Distillation (QAT + KD is SOTA for edge)
Pruning Remove redundant parameters Variable; often needs fine-tuning recovery Distillation or retraining
Architecture search (NAS) Find efficient architecture Hardware-aware gains Distillation to fill capacity

Production stacks often apply distillation first (change architecture and weights), then quantization (compress representation), then optional pruning — each stage validated independently.

Capacity gap and failure modes

The student must be large enough to absorb the teacher’s knowledge. Shrinking a 12-layer transformer to 2 layers may cap performance regardless of distillation quality — the capacity gap is too wide. Empirical rule: aim for students at least 25–50% of teacher parameter count for logit distillation; smaller gaps need feature or multi-stage distillation (train medium student from large teacher, then small from medium).

Other failure modes:

  • Teacher-student domain mismatch — teacher trained on web text, student deployed on legal documents; distill on in-domain data.
  • Biased teacher outputs — student faithfully inherits hallucinations, toxicity, and demographic skew; audit teacher before distilling.
  • Overfitting to teacher errors — student can outperform teacher on hard labels if teacher is wrong but confident; mix ground-truth loss (α tuning).
  • Stale teacher — production teacher updated without re-distilling student; silent quality regression.

Common anti-patterns

  • Distilling a weak teacher — compressing mediocrity saves cost but does not create capability.
  • Narrow distillation data — teacher outputs on 500 prompts; student fails on production long tail.
  • Ignoring hard labels — pure soft-target training without ground truth anchors drifts on edge cases.
  • Skipping ablation — no baseline comparing student-without-distillation, quantized teacher, and full teacher on the same eval harness.
  • Black-box API cost explosion — generating millions of teacher completions from a paid API without caching or curriculum sampling.
  • Evaluating only perplexity — low perplexity does not guarantee task win rate, safety, or instruction-following quality.

Production checklist

  • Validate teacher quality on target task before investing in distillation.
  • Size student for deployment SLOs (latency, memory, batch size) then check capacity gap.
  • Build distillation dataset matching production input distribution and edge cases.
  • Tune temperature T and loss weight α on a validation set.
  • Compare student vs quantized teacher vs undistilled student on task metrics.
  • Audit distilled outputs for inherited bias, hallucination, and safety regressions.
  • Document teacher version, distillation data snapshot, and training hyperparameters.
  • Plan re-distillation when teacher or domain shifts; monitor production KPIs.
  • Stack quantization after distillation when edge deployment requires it.

Key takeaways

  • Knowledge distillation trains a compact student to mimic a capable teacher — transferring soft label structure, not just hard classes.
  • Temperature scaling exposes inter-class relationships that improve student generalization.
  • Feature and relation distillation help when student and teacher architectures differ significantly.
  • LLM distillation often uses teacher-generated text as training data; quality and diversity of that data dominate outcomes.
  • Combine distillation with quantization for maximum inference efficiency; neither replaces a strong teacher or honest evaluation.

Related reading