Guide
LLM FP8 inference explained
Harbor Analytics' internal chat gateway served a 70B model in BF16 on four H100s. Throughput plateaued at 1,200 tokens per second with batch size 32 — enough for dashboards but not for the overnight batch summarization jobs that product wanted. Engineers first tried aggressive INT4 weight quantization via GPTQ. Memory halved, but finance and SQL prompts regressed: wrong decimal places, dropped GROUP BY clauses, and brittle JSON tool outputs. Switching to FP8 inference on Hopper tensor cores recovered most of the speedup while keeping MMLU-style accuracy within 0.3 points of BF16. FP8 is not a magic “free half precision” button — it is a narrow, hardware-coupled format that only pays off when scaling, kernel fusion, and eval gates are done correctly.
FP8 (8-bit floating point) uses the same exponent-mantissa layout as FP16 but with fewer bits, enabling 2× throughput on NVIDIA Hopper and Blackwell tensor cores when both operands are FP8. Unlike integer post-training quantization, FP8 preserves floating-point semantics — dynamic range via exponents — which tends to hurt reasoning tasks less than INT4. This guide covers E4M3 vs E5M2 formats, per-tensor scaling, weight-only vs weight-and-activation FP8, interaction with the KV cache and FlashAttention kernels, the Harbor Analytics gateway refactor, a technique decision table, pitfalls, and a production checklist.
What FP8 is and why it exists
Standard LLM serving runs matrix multiplies in FP16 or BF16. Each multiply-accumulate loads 16 bits per weight and activation. Hopper-class GPUs expose dedicated FP8 tensor core instructions that perform the same GEMMs at 8 bits per operand, roughly doubling math throughput when memory bandwidth is not the sole bottleneck.
Two FP8 encodings dominate LLM work:
- FP8 E4M3 — 4 exponent bits, 3 mantissa bits. Higher precision, smaller dynamic range. Typically used for weights and sometimes forward activations where values are well-scaled.
- FP8 E5M2 — 5 exponent bits, 2 mantissa bits. Wider range, lower precision. Often used for gradients in training and for activation tensors with heavy outliers during inference.
Inference stacks commonly store weights in E4M3, cast activations to E4M3 or E5M2 per layer, and accumulate partial sums in FP16 or FP32. The format is standardized in NVIDIA's Transformer Engine and adopted by vLLM, TensorRT-LLM, and PyTorch 2.x inductor paths. Without Hopper-or-newer hardware, FP8 kernels either fall back to slower emulation or are unavailable entirely.
Per-tensor scaling: the piece teams skip
Raw FP8 cannot represent the full dynamic range of a 70B transformer layer in one shot. Production FP8 therefore uses per-tensor (or per-block) scaling factors: before quantizing a tensor to FP8, divide by a scale s chosen from the tensor's absolute maximum (or a percentile to ignore outliers). After the GEMM, multiply the result by the product of input and weight scales.
Scaling can be:
- Static — computed once offline from calibration batches and baked into the checkpoint. Low runtime overhead; risky if production prompts differ from calibration data.
- Dynamic — recomputed each forward pass from live activation statistics. Slightly more compute; better accuracy on diverse workloads.
Bad scaling is the main reason FP8 outputs diverge from BF16. A single layer with an underestimated scale clips values to FP8 max and injects structured errors downstream. Harbor's eval suite now blocks deploys when any layer's dynamic scale drift exceeds 8% week-over-week — an early signal of data distribution shift or bad calibration shards.
Weight-only vs weight-and-activation FP8
Weight-only FP8 (WFP8)
Weights are stored and loaded as FP8; activations stay in BF16/FP16. The weight tensor is dequantized (or multiplied via FP8 tensor cores with BF16 activations depending on kernel) at compute time. This is the safest first step: roughly 2× weight memory savings and meaningful speedup on memory-bound prefill, with modest accuracy risk.
Weight-and-activation FP8 (W/A FP8)
Both operands to GEMMs are FP8. Maximum tensor core utilization and throughput, but activations with outliers — common in attention softmax outputs and layer-norm residuals — need careful E5M2 casting or delayed quantization. Pair with fused attention kernels that keep sensitive ops in higher precision inside the fusion region.
KV cache precision
The KV cache often remains BF16 even when weights are FP8. FP8 KV blocks exist in research and some serving builds but can amplify position-dependent drift on 32K+ contexts. Treat FP8 KV as a separate experiment with needle-in-haystack and long-dialog evals before enabling fleet-wide.
FP8 vs BF16, FP16, and INT4
| Format | Typical use | Memory vs FP16 | Accuracy profile |
|---|---|---|---|
| BF16 / FP16 | Baseline serving, broad GPU support | 1× | Reference quality |
| FP8 (W/A) | Hopper+ high-throughput serving | ~0.5× weights; faster GEMMs | Usually within 0.1–0.5 pts on broad benchmarks; watch math/SQL |
| INT8 | CPU/older GPU, some edge | ~0.5× | Good for chat; can hurt rare tokens |
| INT4 (GPTQ/AWQ) | Single-GPU 70B, cost minimization | ~0.25× | Larger task variance; strong on 7B–13B |
FP8 occupies a middle ground: more headroom than INT4, more speed than BF16 on supported hardware. It does not replace INT4 when the goal is fitting a 70B model on one 24 GB consumer card — FP8 weights for 70B still need roughly 70 GB. FP8 wins when you already have H100s and want higher tokens/sec per dollar, not when you are trying to eliminate a GPU from the bill.
Serving stack integration
End-to-end FP8 inference requires alignment across the stack:
- Checkpoint — FP8 weights with embedded scales, or BF16 checkpoint plus on-the-fly quantize at load (slower cold start).
- Kernel library — Transformer Engine, cuBLASLt FP8 paths, or framework-specific fused modules. Mixed precision inside LayerNorm and softmax is normal.
- Runtime —
vLLM
--quantization fp8, TensorRT-LLM FP8 builders, or custom Triton. Continuous batching and PagedAttention work unchanged; FP8 affects GEMM and weight load paths only. - Observability — log effective TFLOPs, scale histograms, and per-request quality scores. FP8 regressions often appear as increased refusal rate or JSON parse failures before benchmark scores move.
Pair FP8 with speculative decoding carefully: draft and target models should use compatible precision; a BF16 draft verifying against an FP8 target can add hidden numerical mismatch in acceptance tests.
Harbor Analytics gateway refactor
Harbor's migration from BF16 to FP8 followed a staged rollout:
- Calibration pass — 50k production prompts (PII-redacted) run through the model; per-layer static scales computed as fallback bounds.
- WFP8 canary — 5% traffic on weight-only FP8; dynamic activation scales enabled. SQL accuracy held within 0.4% of BF16 on an internal 200-query suite.
- W/A FP8 expansion — attention and MLP GEMMs fully FP8 on tensor cores; LayerNorm and softmax kept in BF16 inside fused blocks.
- Eval gates — block deploy if GSM8K subset, JSON tool-call success rate, or Harbor-specific finance prompts regress more than 1% absolute.
- Cost accounting — tokens/sec per GPU rose from 300 to 580 on identical batching policy; effective $/1M tokens dropped 42% at fixed hardware.
They explicitly did not quantize the KV cache to FP8 in v1 — long-context nightly jobs still use BF16 cache blocks. INT4 remains in use for a separate 8B edge summarizer where extreme compression matters more than numeric fidelity.
Technique decision table
| Your situation | Prefer | Avoid |
|---|---|---|
| H100/H200 fleet, latency-sensitive chat API | Dynamic W/A FP8 with BF16 KV cache | Staying on BF16 without measuring tensor core utilization |
| First FP8 experiment on production model | Weight-only FP8 + dynamic scales + task evals | Jumping straight to FP8 KV and INT4 weights combined |
| Math, SQL, or finance-heavy workloads | FP8 with conservative scaling + golden-set gates | INT4 without per-domain calibration |
| Single 24 GB GPU, must run 70B | INT4/AWQ or smaller model | FP8 (insufficient memory reduction) |
| A100 or older GPUs | BF16 + INT8/INT4 where supported | FP8 paths with emulation fallback |
| Multi-tenant API with mixed prompt lengths | Dynamic scaling + per-tenant outlier monitoring | Static scales from a single calibration domain |
Common pitfalls
- Assuming FP8 is interchangeable with INT4. Different hardware paths, different failure modes; FP8 needs Hopper+ for real wins.
- Static scales from unrepresentative calibration. Code-heavy production traffic needs code in the calibration set or dynamic scaling.
- Ignoring softmax and LayerNorm precision. Forcing full FP8 through every op increases drift; keep sensitive ops in BF16 inside fusions.
- FP8 KV cache without long-context eval. Needle tests and multi-hour chats catch drift perplexity misses.
- Mixed-precision speculative decoding. Align draft/target dtypes or validate acceptance rates drop.
- Skipping task-specific evals. Aggregate MMLU can hide 5% JSON tool breakage.
- No scale telemetry. Clipping shows up in scales before users report quality issues.
Production checklist
- Confirm GPU generation supports native FP8 tensor cores (Hopper or newer).
- Choose E4M3 for weights; document activation format per layer.
- Implement dynamic per-tensor scaling with static calibration bounds.
- Start with weight-only FP8; promote to W/A after eval gates pass.
- Keep KV cache in BF16 until long-context evals approve FP8 blocks.
- Retain BF16/FP16 inside LayerNorm, softmax, and residual adds in fused kernels.
- Build golden sets for SQL, JSON tools, and domain-specific numeric tasks.
- Log scale factors and clipping rates per layer in staging.
- Canary 5–10% traffic; compare tokens/sec, P99 latency, and error rates.
- Document rollback path to BF16 checkpoint without redeploying weights.
- Pair with continuous batching in vLLM or equivalent runtime.
- Re-run evals after every base model fine-tune or LoRA merge.
Key takeaways
- FP8 inference uses 8-bit floating GEMMs on Hopper+ tensor cores to roughly double math throughput versus BF16 while usually preserving broader accuracy than INT4.
- E4M3 and E5M2 trade precision for range; per-tensor dynamic scaling is mandatory for production quality.
- Weight-only FP8 is the safe first step; weight-and-activation FP8 maximizes speed but needs fused high-precision ops around LayerNorm and attention.
- FP8 solves throughput on hardware you already have; INT4 solves fitting oversized models onto small GPUs — different problems.
- Harbor Analytics cut $/token 42% with staged FP8 rollout, BF16 KV cache, and task-specific eval gates — not by toggling a single quant flag.
Related reading
- LLM quantization explained — INT4, GPTQ, AWQ and compression trade-offs
- LLM KV cache explained — prefill, decode, and memory scaling
- vLLM fundamentals explained — PagedAttention and continuous batching
- FlashAttention explained — IO-aware attention and fused kernels