Guide

LLM knowledge distillation explained

Harbor Analytics ran a 70B-parameter teacher model for internal support triage and achieved 91% resolution accuracy on a held-out ticket set — but median time-to-first-token (TTFT) was 2.4 seconds and GPU cost per 1,000 tickets exceeded budget. Quantizing the same 70B checkpoint to FP8 cut latency to 1.1 seconds yet accuracy fell to 84%. Distilling the teacher into an 8B student on 40,000 labeled ticket-response pairs recovered 89% accuracy at 180 ms TTFT — close to teacher quality at roughly one-ninth the inference cost.

Knowledge distillation trains a smaller student model to mimic a larger or more capable teacher. Instead of learning only from hard ground-truth labels, the student matches the teacher’s probability distribution over tokens (soft labels), and optionally its hidden states or intermediate reasoning traces. For production LLM stacks, distillation is how teams ship fast, cheap models that retain most of a frontier model’s behavior on a narrow domain. This guide covers objectives, temperature scaling, distillation variants, a Harbor Analytics refactor walkthrough, a technique decision table versus quantization-only and fine-tuning-only approaches, pitfalls, and a deployment checklist.

What knowledge distillation is (and is not)

Classic supervised fine-tuning minimizes cross-entropy against a single correct token at each position. That is sparse supervision: the model learns that “refund” is correct but receives little signal about plausible alternatives like “reimbursement” or “credit.” A teacher LLM outputs a full vocabulary distribution; distillation treats that distribution as a rich training target.

Distillation is not the same as:

  • Post-training quantization — shrinking weights and activations without changing what the model learned.
  • Pruning — removing parameters or heads from an existing architecture.
  • Speculative decoding — using a small draft model at inference time while a large model verifies tokens; no weight update required.

Distillation does change student weights through additional training. It is closest to fine-tuning but the supervision signal comes primarily from the teacher rather than (or in addition to) human labels.

Core objective: soft labels and temperature

Given input x and next-token position t, the teacher produces logits zT and the student produces zS. With temperature τ, softened probabilities are:

pTτ = softmax(zT / τ)

The distillation loss is typically KL divergence between teacher and student softened distributions, plus a weighted hard-label cross-entropy term when ground truth is available:

L = α · KL(pTτ || pSτ) + (1 − α) · CE(y, pS)

Temperature controls how much dark knowledge transfers. Low τ (e.g. 1) peaks the distribution on the top token; high τ (e.g. 2–5) spreads probability mass across synonyms and phrasing variants the teacher considers plausible. For LLM distillation, teams often use τ = 2 for response-level training and tune α between 0.5 and 0.9 depending on label quality.

Dark knowledge is the term Geoffrey Hinton used for the information in non-argmax logits — e.g. a teacher assigning 0.15 probability to “maybe” when the label is “yes” signals calibrated uncertainty the student can internalize.

Distillation variants for LLMs

Response (sequence-level) distillation

The teacher generates full responses (or chain-of-thought traces) on a prompt corpus; the student is trained to match those token sequences via soft-label KL at each step. This is the most common production pattern because it needs no architectural alignment between teacher and student — a 70B decoder-only model can distill into a 3B model with a different layer count.

Hidden-state (feature) distillation

Intermediate representations from teacher layer L are matched to student layer L′ via an MSE or cosine loss, sometimes with a learned projection when dimensions differ. Feature distillation helps when the student is much smaller and needs structural hints beyond token probabilities. It requires aligned forward passes and adds memory overhead during training.

Attention transfer

A lighter variant matches attention maps or value vectors between teacher and student heads. Useful in vision and smaller language models; less common at 7B+ scale because head layouts rarely align one-to-one.

Self-distillation

The same model (or an earlier checkpoint) acts as teacher. Self-distillation with temperature on a model’s own logits can improve calibration and generalization without a larger external teacher — but it does not reduce model size. It is a quality boost, not a compression technique.

On-policy (online) distillation

The student generates rollouts during training; the teacher scores or relabels those trajectories. On-policy distillation reduces exposure bias (where the student only sees teacher prefixes) but costs more because the teacher runs on student outputs every step. Common in RL-style alignment pipelines and tool-use fine-tuning.

Data pipeline: what to distill on

Distillation quality is bounded by prompt coverage and teacher correctness. A practical corpus mixes:

  • Task-specific prompts — real user queries, support tickets, code repos, or domain documents the student must handle in production.
  • Hard negatives — prompts where the base student fails but the teacher succeeds; oversampling these regions improves tail behavior.
  • Diverse phrasings — paraphrased inputs so the student generalizes beyond exact prompt memorization.
  • Filtered teacher outputs — discard responses that fail automated checks (format, citation, safety) before they become training targets.

For Harbor Analytics, the team exported 40,000 anonymized support tickets, ran the 70B teacher with a structured JSON output schema, and kept only responses that passed a validator and matched human agent resolution in a 500-ticket audit sample. That filtering step mattered more than increasing corpus size from 40k to 120k unfiltered examples.

Harbor Analytics support-bot refactor (worked example)

Starting point: 70B instruction-tuned teacher on 8x H100, FP8 inference, 1.1 s median TTFT, $0.42 per 1,000 tickets at fleet utilization.

Goal: Sub-300 ms TTFT on a single L40S per replica while keeping resolution accuracy above 87%.

  1. Student selection — 8B base model in the same tokenizer family as the teacher to avoid vocabulary mismatch.
  2. Corpus generation — teacher responses on 40k tickets with temperature 0.7 for phrasing diversity; store full logit top-k (k=50) per token to reduce storage versus full vocab.
  3. Training — 3 epochs, τ=2, α=0.7, LoRA rank 64 on attention projections via distributed training on 4x A100; batch size 128 sequences of 2,048 tokens.
  4. Evaluation — resolution accuracy, schema validity, hallucinated policy citations, and TTFT on a 2,000-ticket holdout.
  5. Deployment — merged LoRA weights, INT8 weight-only quant for extra throughput, served through the existing inference gateway.

Outcome: 89% resolution accuracy (vs 91% teacher, 82% base 8B fine-tuned on hard labels only), 180 ms TTFT, $0.05 per 1,000 tickets. Hard-label-only fine-tuning on the same 40k pairs reached 86% — the soft-label signal recovered most of the gap to the teacher without running the 70B model at inference time.

Technique decision table

Goal Prefer Why not the alternative
Cut inference cost 5–10x on a fixed task Distill into smaller student Quantization alone rarely closes a full size-class gap
Keep same model, speed up matmuls FP8 / INT4 quantization Distillation requires retraining and a teacher
Teach new facts from a document corpus RAG or fine-tuning on sources Distillation copies teacher behavior, not raw documents
Match frontier reasoning on broad open domain Larger teacher or API model at runtime Small students lose out-of-distribution breadth
Improve calibration without shrinking Self-distillation Does not reduce memory or latency
Draft-verify speedup at inference Speculative decoding Still loads the large model; no training pass
Domain style and tone from examples Fine-tuning + optional distillation Hard labels alone miss synonym structure

Common pitfalls

  • Distilling teacher mistakes — unfiltered teacher outputs encode hallucinations; audit before training.
  • Prompt distribution shift — student trained on synthetic prompts fails on messy real user text.
  • Overfitting soft labels — too many epochs memorizes teacher phrasing verbatim; monitor held-out paraphrases.
  • Tokenizer mismatch — different vocabularies between teacher and student break logit alignment.
  • Ignoring safety alignment — distilling a aligned teacher into an unaligned base can surface suppressed behaviors; include safety evals.
  • Expecting general intelligence transfer — students inherit task-specific skill, not the teacher’s full world knowledge.
  • Storing full teacher logits — vocab-sized tensors explode storage; top-k logit caching is usually enough.
  • Skipping hard-label mixing — pure KL on noisy teacher outputs drifts; keep a CE term on verified labels.

Production checklist

  • Define task-specific success metrics (accuracy, format validity, latency, cost per 1k requests).
  • Select student architecture with compatible tokenizer and sufficient capacity for the task.
  • Build prompt corpus from production logs with PII scrubbing and consent policy review.
  • Generate teacher responses; filter with validators and human spot-checks on a sample.
  • Cache top-k teacher logits or run teacher forward pass online during training.
  • Train with tuned τ and α; compare against hard-label-only baseline.
  • Evaluate on held-out real prompts including adversarial and out-of-vocabulary cases.
  • Run safety regression suite before replacing teacher in production path.
  • Deploy student with appropriate quantization; keep teacher available for shadow evaluation.
  • Monitor drift weekly; schedule re-distillation when teacher or policy updates.

Key takeaways

  • Knowledge distillation transfers a teacher LLM’s soft probability structure to a smaller student, not just argmax labels.
  • Temperature scaling exposes dark knowledge in non-top tokens; mix KL loss with hard-label cross-entropy for stability.
  • Response-level distillation is the default production pattern; feature distillation helps when architectures align.
  • Corpus quality and teacher output filtering matter more than raw example count.
  • Harbor Analytics recovered 89% teacher accuracy at roughly one-ninth inference cost by distilling 70B into 8B on filtered support tickets.

Related reading