Guide

Speculative decoding explained

Harbor Support's tier-1 chat gateway served a 70B instruction model at 14 tokens per second on a single A100 — fine for short replies, painful for multi-paragraph troubleshooting. Engineering tried wider GQA batches and Flash Attention kernels first; decode still dominated because each new token required a full forward pass through 80 layers. The breakthrough was speculative decoding: a small 7B draft model proposes several candidate tokens per step, and the 70B target verifies them in one batched forward pass. Median time-to-first-token stayed flat, but sustained throughput rose to 22 tok/s (−38% latency on 200-token answers) with identical output distributions when acceptance sampling is implemented correctly. Speculative decoding is an inference technique that uses a fast draft model to guess upcoming tokens and a larger target model to accept or reject those guesses, amortizing expensive target forward passes across multiple output tokens. This guide covers the draft-and-verify loop, acceptance sampling math, draft model selection, Medusa and EAGLE lookahead variants, integration with vLLM continuous batching, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist alongside our KV cache guide.

Why decode is the bottleneck

LLM inference splits into two phases. Prefill processes the entire prompt in parallel — matrix dimensions are large, GPUs stay busy. Decode generates one token at a time: each step attends over the growing sequence, updates the KV cache, and runs a skinny matmul through every layer. Batch size is often 1 for interactive chat, so memory bandwidth — not FLOPs — caps throughput. Speculative decoding attacks decode directly: instead of N target forward passes for N tokens, you hope to verify k draft tokens in a single target pass.

The speedup formula

Let α be the average number of draft tokens accepted per verification step, c the cost ratio (target forward time divided by draft forward time for the same k tokens), and k the draft lookahead length. Rough expected speedup versus naive autoregressive decode:

speedup ≈ α / (1 + c × draft_overhead)

When draft and target share vocabulary and the draft is well-aligned, acceptance rates of 2–3 tokens per step are common — enough to justify the extra GPU memory for a second model.

Draft-and-verify loop

At each iteration the system runs the following cycle:

Draft phase. The small model autoregressively generates k candidate tokens from the current prefix (often k = 4–8).
Verify phase. The target model runs one forward pass on the concatenated prefix + all k draft tokens, producing logits at each position.
Acceptance. Compare draft tokens to target distributions position by position; accept the longest matching prefix; resample the first rejected token from the corrected distribution.
Cache update. Append accepted tokens (plus one resampled token if needed) to the KV caches of both models.

Crucially, when implemented with the standard rejection sampling scheme, the output token sequence is exactly distributed as if the target model generated alone — speculative decoding is not an approximation unless you deliberately relax acceptance rules.

Acceptance sampling

For draft token x at position t, let p(x) be the target probability and q(x) the draft probability. Accept with probability min(1, p(x)/q(x)). If rejected, sample the next token from the adjusted distribution norm(max(0, p(·) − q(·))). This preserves the target's marginal distribution and allows variable-length accepts per step — sometimes all k draft tokens pass, sometimes only one.

Choosing a draft model

The draft does not need to match the target architecture, but alignment matters more than raw speed:

Same tokenizer and vocabulary — mismatched vocabs make acceptance sampling undefined; always pair models trained on the same tokenizer.
Family alignment — a Llama-3-8B draft paired with Llama-3-70B outperforms a random small model because conditional distributions overlap on typical continuations.
Size ratio — drafts 5–15× smaller than the target balance acceptance rate against verification cost; Harbor Support uses 7B → 70B (10×).
Quantization — INT8 or FP8 draft models free VRAM for the target; see our quantization guide for calibration tradeoffs.

Learned draft heads (Medusa, EAGLE)

Instead of a separate small model, Medusa attaches multiple lightweight prediction heads to the target's hidden states, proposing several future tokens without an extra full model. EAGLE trains a small autoregressive head on target features for higher acceptance rates. These approaches save memory (no second weights file) but require training or fine-tuning; off-the-shelf draft checkpoints are faster to ship.

Harbor Support refactor

Before speculative decoding, Harbor Support's vLLM deployment ran the 70B model alone with continuous batching and PagedAttention. P99 latency on 150-token answers was 11.2 seconds. The refactor added:

A quantized 7B draft on the same GPU (tensor-parallel shard 0 spare capacity).
speculative_max_model_len capped at 8 draft tokens per step.
Per-request acceptance telemetry logged to Prometheus (spec_accepted_tokens_total, spec_draft_steps_total).
Fallback to non-speculative decode when draft VRAM is contended during traffic spikes.

Results after one week: median latency −38%, P99 −22%, acceptance rate 2.4 tokens/step on support tickets (procedural text with predictable phrasing). Creative writing endpoints stayed on target-only decode where acceptance dropped below 1.3 tokens/step.

Technique decision table

Technique	Strength	Weakness	Best for
Separate-model speculative decoding	Exact target distribution; no retraining	Extra VRAM for draft weights	Production chat on large targets with aligned small checkpoints
Medusa / multi-head lookahead	No second model file	Requires head training; lower accept on OOD text	Single-model deployments with fine-tune budget
EAGLE feature draft	Higher acceptance than vanilla draft	Training pipeline complexity	High-QPS APIs where 0.5 tok/s matters
Larger batch only	Simple; no algorithm change	Does not help batch-1 interactive latency	Offline batch inference
Quantized target (INT4/FP8)	Lower memory; faster matmuls	Quality regression risk	VRAM-constrained single-GPU setups
Prompt caching / prefix reuse	Speeds prefill on long system prompts	Does not accelerate decode phase	RAG with static context blocks

Common pitfalls

Tokenizer mismatch — draft and target must share identical vocab; mixing model families without verification silently corrupts outputs.
Skipping exact sampling — greedy acceptance (“take draft if argmax matches”) changes the output distribution; always implement proper rejection sampling.
Ignoring draft KV cache sync — after partial acceptance, rewind draft cache to the last agreed prefix or regenerate from scratch.
Fixed k regardless of acceptance — adaptive lookahead (shorter k when acceptance drops) saves wasted verify FLOPs on creative prompts.
VRAM oversubscription — loading draft + target can OOM; quantize the draft or offload draft to a second GPU with NVLink.
No telemetry — without acceptance-rate dashboards you cannot tell whether speculative decode is helping or burning GPU cycles.
Spec decode on prefill — speculative decoding applies to decode only; do not run draft loops during prompt ingestion.

Production checklist

Confirm draft and target share tokenizer, vocab size, and special tokens.
Implement rejection sampling; add unit tests that compare speculative vs naive output distributions on a fixed seed set.
Log acceptance rate, draft steps per output token, and verify latency separately.
Benchmark with production prompt length distribution — not just synthetic short prompts.
Set adaptive k or disable speculative decode when acceptance < 1.5 tokens/step for 5 minutes.
Reserve VRAM headroom; define graceful fallback to target-only on OOM.
Pair with continuous batching in vLLM or TGI; verify speculative paths work under concurrent requests.
Version draft checkpoints alongside target; block deploy if draft revision drifts without A/B acceptance test.
Document latency SLO impact in runbooks; speculative decode changes P99 variance characteristics.
Re-evaluate quarterly as target model updates — draft alignment degrades when target is fine-tuned without draft refresh.

Key takeaways

Decode is memory-bandwidth bound at batch size 1; speculative decoding amortizes target forward passes across multiple tokens.
Proper acceptance sampling preserves the target model's exact output distribution — it is a speed trick, not a quality compromise.
Draft-target alignment (same family, same tokenizer) matters more than draft raw speed.
Medusa and EAGLE trade training complexity for memory savings versus a separate draft model.
Measure acceptance rate per endpoint; disable speculative decode where creative variance kills acceptance.