Guide
LLM knowledge distillation explained
Harbor Analytics ran a 70B-parameter teacher model for internal support triage and achieved 91% resolution accuracy on a held-out ticket set — but median time-to-first-token (TTFT) was 2.4 seconds and GPU cost per 1,000 tickets exceeded budget. Quantizing the same 70B checkpoint to FP8 cut latency to 1.1 seconds yet accuracy fell to 84%. Distilling the teacher into an 8B student on 40,000 labeled ticket-response pairs recovered 89% accuracy at 180 ms TTFT — close to teacher quality at roughly one-ninth the inference cost.
Knowledge distillation trains a smaller student model to mimic a larger or more capable teacher. Instead of learning only from hard ground-truth labels, the student matches the teacher’s probability distribution over tokens (soft labels), and optionally its hidden states or intermediate reasoning traces. For production LLM stacks, distillation is how teams ship fast, cheap models that retain most of a frontier model’s behavior on a narrow domain. This guide covers objectives, temperature scaling, distillation variants, a Harbor Analytics refactor walkthrough, a technique decision table versus quantization-only and fine-tuning-only approaches, pitfalls, and a deployment checklist.
What knowledge distillation is (and is not)
Classic supervised fine-tuning minimizes cross-entropy against a single correct token at each position. That is sparse supervision: the model learns that “refund” is correct but receives little signal about plausible alternatives like “reimbursement” or “credit.” A teacher LLM outputs a full vocabulary distribution; distillation treats that distribution as a rich training target.
Distillation is not the same as:
- Post-training quantization — shrinking weights and activations without changing what the model learned.
- Pruning — removing parameters or heads from an existing architecture.
- Speculative decoding — using a small draft model at inference time while a large model verifies tokens; no weight update required.
Distillation does change student weights through additional training. It is closest to fine-tuning but the supervision signal comes primarily from the teacher rather than (or in addition to) human labels.
Core objective: soft labels and temperature
Given input x and next-token position t, the teacher
produces logits zT and the student produces
zS. With temperature τ, softened
probabilities are:
pTτ = softmax(zT / τ)
The distillation loss is typically KL divergence between teacher and student softened distributions, plus a weighted hard-label cross-entropy term when ground truth is available:
L = α · KL(pTτ || pSτ) + (1 − α) · CE(y, pS)
Temperature controls how much dark knowledge transfers.
Low τ (e.g. 1) peaks the distribution on the top token;
high τ (e.g. 2–5) spreads probability mass across
synonyms and phrasing variants the teacher considers plausible. For LLM
distillation, teams often use τ = 2 for response-level
training and tune α between 0.5 and 0.9 depending on label
quality.
Dark knowledge is the term Geoffrey Hinton used for the information in non-argmax logits — e.g. a teacher assigning 0.15 probability to “maybe” when the label is “yes” signals calibrated uncertainty the student can internalize.
Distillation variants for LLMs
Response (sequence-level) distillation
The teacher generates full responses (or chain-of-thought traces) on a prompt corpus; the student is trained to match those token sequences via soft-label KL at each step. This is the most common production pattern because it needs no architectural alignment between teacher and student — a 70B decoder-only model can distill into a 3B model with a different layer count.
Hidden-state (feature) distillation
Intermediate representations from teacher layer L are matched to
student layer L′ via an MSE or cosine loss, sometimes with
a learned projection when dimensions differ. Feature distillation helps when
the student is much smaller and needs structural hints beyond token
probabilities. It requires aligned forward passes and adds memory overhead
during training.
Attention transfer
A lighter variant matches attention maps or value vectors between teacher and student heads. Useful in vision and smaller language models; less common at 7B+ scale because head layouts rarely align one-to-one.
Self-distillation
The same model (or an earlier checkpoint) acts as teacher. Self-distillation with temperature on a model’s own logits can improve calibration and generalization without a larger external teacher — but it does not reduce model size. It is a quality boost, not a compression technique.
On-policy (online) distillation
The student generates rollouts during training; the teacher scores or relabels those trajectories. On-policy distillation reduces exposure bias (where the student only sees teacher prefixes) but costs more because the teacher runs on student outputs every step. Common in RL-style alignment pipelines and tool-use fine-tuning.
Data pipeline: what to distill on
Distillation quality is bounded by prompt coverage and teacher correctness. A practical corpus mixes:
- Task-specific prompts — real user queries, support tickets, code repos, or domain documents the student must handle in production.
- Hard negatives — prompts where the base student fails but the teacher succeeds; oversampling these regions improves tail behavior.
- Diverse phrasings — paraphrased inputs so the student generalizes beyond exact prompt memorization.
- Filtered teacher outputs — discard responses that fail automated checks (format, citation, safety) before they become training targets.
For Harbor Analytics, the team exported 40,000 anonymized support tickets, ran the 70B teacher with a structured JSON output schema, and kept only responses that passed a validator and matched human agent resolution in a 500-ticket audit sample. That filtering step mattered more than increasing corpus size from 40k to 120k unfiltered examples.
Harbor Analytics support-bot refactor (worked example)
Starting point: 70B instruction-tuned teacher on 8x H100, FP8 inference, 1.1 s median TTFT, $0.42 per 1,000 tickets at fleet utilization.
Goal: Sub-300 ms TTFT on a single L40S per replica while keeping resolution accuracy above 87%.
- Student selection — 8B base model in the same tokenizer family as the teacher to avoid vocabulary mismatch.
- Corpus generation — teacher responses on 40k tickets with temperature 0.7 for phrasing diversity; store full logit top-k (k=50) per token to reduce storage versus full vocab.
- Training — 3 epochs,
τ=2,α=0.7, LoRA rank 64 on attention projections via distributed training on 4x A100; batch size 128 sequences of 2,048 tokens. - Evaluation — resolution accuracy, schema validity, hallucinated policy citations, and TTFT on a 2,000-ticket holdout.
- Deployment — merged LoRA weights, INT8 weight-only quant for extra throughput, served through the existing inference gateway.
Outcome: 89% resolution accuracy (vs 91% teacher, 82% base 8B fine-tuned on hard labels only), 180 ms TTFT, $0.05 per 1,000 tickets. Hard-label-only fine-tuning on the same 40k pairs reached 86% — the soft-label signal recovered most of the gap to the teacher without running the 70B model at inference time.
Technique decision table
| Goal | Prefer | Why not the alternative |
|---|---|---|
| Cut inference cost 5–10x on a fixed task | Distill into smaller student | Quantization alone rarely closes a full size-class gap |
| Keep same model, speed up matmuls | FP8 / INT4 quantization | Distillation requires retraining and a teacher |
| Teach new facts from a document corpus | RAG or fine-tuning on sources | Distillation copies teacher behavior, not raw documents |
| Match frontier reasoning on broad open domain | Larger teacher or API model at runtime | Small students lose out-of-distribution breadth |
| Improve calibration without shrinking | Self-distillation | Does not reduce memory or latency |
| Draft-verify speedup at inference | Speculative decoding | Still loads the large model; no training pass |
| Domain style and tone from examples | Fine-tuning + optional distillation | Hard labels alone miss synonym structure |
Common pitfalls
- Distilling teacher mistakes — unfiltered teacher outputs encode hallucinations; audit before training.
- Prompt distribution shift — student trained on synthetic prompts fails on messy real user text.
- Overfitting soft labels — too many epochs memorizes teacher phrasing verbatim; monitor held-out paraphrases.
- Tokenizer mismatch — different vocabularies between teacher and student break logit alignment.
- Ignoring safety alignment — distilling a aligned teacher into an unaligned base can surface suppressed behaviors; include safety evals.
- Expecting general intelligence transfer — students inherit task-specific skill, not the teacher’s full world knowledge.
- Storing full teacher logits — vocab-sized tensors explode storage; top-k logit caching is usually enough.
- Skipping hard-label mixing — pure KL on noisy teacher outputs drifts; keep a CE term on verified labels.
Production checklist
- Define task-specific success metrics (accuracy, format validity, latency, cost per 1k requests).
- Select student architecture with compatible tokenizer and sufficient capacity for the task.
- Build prompt corpus from production logs with PII scrubbing and consent policy review.
- Generate teacher responses; filter with validators and human spot-checks on a sample.
- Cache top-k teacher logits or run teacher forward pass online during training.
- Train with tuned
τandα; compare against hard-label-only baseline. - Evaluate on held-out real prompts including adversarial and out-of-vocabulary cases.
- Run safety regression suite before replacing teacher in production path.
- Deploy student with appropriate quantization; keep teacher available for shadow evaluation.
- Monitor drift weekly; schedule re-distillation when teacher or policy updates.
Key takeaways
- Knowledge distillation transfers a teacher LLM’s soft probability structure to a smaller student, not just argmax labels.
- Temperature scaling exposes dark knowledge in non-top tokens; mix KL loss with hard-label cross-entropy for stability.
- Response-level distillation is the default production pattern; feature distillation helps when architectures align.
- Corpus quality and teacher output filtering matter more than raw example count.
- Harbor Analytics recovered 89% teacher accuracy at roughly one-ninth inference cost by distilling 70B into 8B on filtered support tickets.
Related reading
- Distributed LLM training explained — multi-GPU setup for distillation fine-tunes
- LLM fine-tuning vs RAG explained — when to train weights versus retrieve documents
- LLM FP8 inference explained — post-training quantization for extra throughput
- LLM inference serving explained — deploying distilled models behind production APIs