Guide
LLM quantization explained
A 70B-parameter model in full FP16 needs roughly 140 GB of GPU memory just for weights — before the KV cache, activations, or batching overhead. Most teams cannot afford eight H100s per replica, so they quantize: represent weights (and sometimes activations) in lower-precision integers while keeping outputs close to the original model. LLM quantization is the set of techniques that map floating-point tensors to INT8, INT4, or mixed formats, then run hardware-accelerated matrix multiplies on the compressed representation. Done well, a 4-bit 7B model fits on a consumer GPU and still answers support tickets; done poorly, math reasoning collapses and legal disclaimers go missing. This guide explains precision ladders, post-training vs training-aware methods, popular formats (GPTQ, AWQ, GGUF, bitsandbytes), weight-only vs activation quantization, KV cache compression, a Harbor Analytics edge deployment worked example, a method decision table, common pitfalls, and a production checklist. For broader spend control see LLM cost optimization; for serving stacks see LLM inference serving.
What quantization does to a model
Neural networks store billions of learned coefficients. In full precision (FP32), each weight is a 32-bit float. Quantization replaces those floats with a small set of discrete levels — often 256 levels for INT8 or 16 levels for INT4 — plus per-tensor or per-channel scale and zero-point metadata that map integers back to approximate real values.
The basic mapping for asymmetric quantization is:
real ≈ scale × (quantized − zero_point)
During inference, linear layers become integer matrix multiplies followed by dequantization (or fused kernels that skip explicit dequant). The goal is not bit-exact reproduction of FP16 logits but perplexity and task accuracy within a tolerance band at a fraction of memory and bandwidth.
Precision ladder
- FP32 — training reference; rarely used for LLM inference at scale.
- FP16 / BF16 — default datacenter inference; BF16 has wider exponent range, common on TPUs and newer GPUs.
- INT8 — ~2× weight compression vs FP16; mature kernels on CPU and GPU.
- INT4 (and mixed 3/4/8-bit) — ~4× weight compression; dominant for local and edge LLMs; quality depends heavily on calibration data and algorithm.
- 1-bit / ternary experiments — research frontier; not production-default for general assistants.
Memory savings apply primarily to weights. Activation memory during long contexts still scales with sequence length unless you also quantize activations or compress the KV cache.
Post-training quantization vs quantization-aware training
Teams choose between compressing an already-trained checkpoint or baking low precision into training.
Post-training quantization (PTQ)
PTQ takes a finished FP16/BF16 model, runs calibration forward passes on a representative dataset (hundreds to thousands of samples), estimates per-layer scales, and writes quantized weights. No gradient updates. Fast to apply, works with open weights you did not train, and powers most local inference workflows. Risk: outlier channels in transformer MLP and attention projections can dominate error if calibration data does not match production prompts.
Quantization-aware training (QAT)
QAT simulates quantized arithmetic during fine-tuning so the model learns to tolerate rounding. Higher setup cost, needs GPU training budget, but often recovers accuracy at INT8 or aggressive INT4 when PTQ fails — especially for domain-specific models (medical coding, legal clause extraction). Most product teams QAT only after PTQ benchmarks miss SLOs on their eval set.
When PTQ is enough
General chat, summarization, and RAG over docs usually tolerate GPTQ/AWQ INT4 on 7B–13B bases. When PTQ struggles: heavy JSON schema adherence, multi-step arithmetic, rare-token languages, or tiny models where each bit matters more.
Popular quantization formats and tools
The ecosystem fragmented into weight formats tied to runtimes. Pick format + engine together, not in isolation.
GPTQ
GPTQ (Generalized Post-Training Quantization) quantizes weights
column-wise using approximate second-order information. Hugging Face
auto-gptq and community quants on TheBloke-style repos made GPTQ the
default for CUDA single-GPU inference for years. Strength: broad model coverage.
Weakness: sensitive to calibration set; INT4 quality varies by architecture
(Llama vs Mistral vs Qwen).
AWQ
AWQ (Activation-aware Weight Quantization) protects salient weight channels identified by activation magnitudes on calibration data. Often matches or beats GPTQ at 4-bit on reasoning benchmarks with the same memory footprint. Integrated in vLLM, TensorRT-LLM, and llama.cpp builds. Good default when your serving engine lists AWQ kernels for your GPU generation.
GGUF and llama.cpp
GGUF is a file format for mixed-bit weights consumed by
llama.cpp and bindings (Ollama, LM Studio). Supports Q4_K_M,
Q5_K_S, and other presets blending 4-bit and 5-bit blocks. Dominant for CPU +
Apple Silicon local inference where CUDA is unavailable. Trade-off: ecosystem
separate from PyTorch training stacks; re-quantize when upstream base weights
change.
bitsandbytes NF4
bitsandbytes 4-bit NormalFloat (NF4) targets
QLoRA fine-tuning — load base in 4-bit, train low-rank
adapters in FP16. Also used for inference via load_in_4bit=True in
Transformers. Excellent for adapter training on one GPU; inference throughput
may trail dedicated AWQ/GPTQ kernels unless you export merged weights to a
serving engine.
FP8 and vendor kernels
H100-class GPUs push FP8 tensor cores (weight + activation) inside TensorRT-LLM and vLLM. Not integer quantization, but the same economic story: lower bandwidth, higher FLOPs per watt. Requires hardware generation match and careful accuracy validation on your eval suite.
Weight-only vs activation quantization
Weight-only quantization (WOQ) stores parameters in INT4/INT8 but computes activations in FP16. Simpler kernels, most open-source local inference defaults. Memory win on weights; activation and KV cache remain FP16 unless separately optimized.
Weight + activation quantization (WAQ) quantizes intermediate tensors during the forward pass. Higher throughput on supported hardware when fused kernels avoid repeated FP16 round-trips. Harder to debug; accuracy cliffs show up on long contexts or unusual prompt distributions.
KV cache quantization
Long chats allocate KV tensors proportional to layers × heads × sequence length × head dim. KV cache INT8/FP8 cuts memory so you serve more concurrent users or longer contexts on the same GPU. vLLM and TensorRT-LLM expose FP8 KV on Hopper; validate retrieval-heavy RAG prompts where attention over thousands of tokens must stay sharp.
Accuracy, latency, and memory trade-offs
Quantization is a three-way optimization. Measure all three on your prompts, not leaderboard averages alone.
- Memory — weight bytes + KV + activations. INT4 WOQ roughly quarters weight RAM; a 7B model drops from ~14 GB to ~4–5 GB plus overhead.
- Latency — memory bandwidth often caps decode tokens/sec; smaller weights raise throughput until batching or CPU preprocessing bottlenecks. Prefill may not speed up proportionally.
- Quality — track perplexity on a held-out set plus task pass rates (JSON validity, citation accuracy, refusal behavior). A 2% benchmark drop may be unacceptable in finance, fine in draft email assist.
Aggressive 3-bit quants and aggressive grouping win size benchmarks but fail structured-output tasks. Keep an FP16 shadow route for escalation when confidence scores dip — same pattern as model routing.
Worked example: Harbor Analytics on-prem summarizer
Harbor Analytics runs a nightly batch job: summarize 2,000 warehouse incident PDFs on a single RTX 4090 (24 GB). Requirements: extract severity, root cause, and three bullet recommendations; output strict JSON; no cloud API (customer data residency).
Baseline failure
Llama-3.1 8B FP16 plus 8k context KV cache peaks at ~18 GB — tight for parallel batches. Throughput: 38 docs/hour. FP16 70B is impossible on one card.
Quantization path
- Download AWQ INT4 8B weights compatible with vLLM on Ada GPUs.
- Calibrate not needed (pre-quantized); run 200 held-out incidents against FP16 baseline; JSON schema pass rate 97.1% vs 97.8% — within SLO.
- Enable FP8 KV cache; max concurrent sequences rise from 2 to 5.
- Throughput: 112 docs/hour; GPU memory headroom for a small classifier routing obvious duplicates.
What they skipped
Q3_K quants failed JSON field enums (severity levels). QAT was deferred — PTQ AWQ met accuracy. They log quant format in manifest files so ops never mixes GGUF and AWQ checkpoints in the same vLLM deployment.
Quantization method decision table
| Scenario | Recommended approach | Watch out for |
|---|---|---|
| Single NVIDIA GPU, vLLM/TGI serve | AWQ or GPTQ INT4 WOQ | Kernel support per model arch and GPU gen. |
| Mac / CPU local assistant | GGUF Q4_K_M via llama.cpp or Ollama | Different quant presets change quality a lot. |
| QLoRA fine-tune on one 24 GB card | bitsandbytes NF4 base + FP16 LoRA | Merge adapters before AWQ export for prod serve. |
| H100 multi-tenant API | FP8 weights + FP8 KV (TensorRT-LLM / vLLM) | Requires Hopper; validate long-context RAG. |
| INT4 PTQ misses math eval | Try AWQ > GPTQ > INT8 > QAT fine-tune | Calibration data must match prod prompts. |
| Structured JSON / tool calls | Conservative Q4 or INT8; keep FP16 fallback | Aggressive 3-bit breaks enum fields. |
| Edge phone NPU | Vendor SDK (Core ML, NNAPI) with INT8 | Export path separate from CUDA quants. |
| Regulated audit trail | Document quant method + eval delta vs FP16 | “4-bit” alone is not a reproducible spec. |
Common pitfalls
- Benchmarking on MMLU only — aggregate MCQ scores hide JSON and citation regressions your product actually ships.
- Wrong calibration corpus — PTQ on Wikipedia when production prompts are JSON logs and stack traces.
- Mixing formats and engines — loading GGUF weights into vLLM or AWQ into llama.cpp wastes debug hours.
- Ignoring KV cache — INT4 weights still OOM on 32k context without KV compression or paging.
- Quantizing after bad merges — broken model merges stay broken; quantization cannot fix incoherent bases.
- Skipping rollback — no FP16 canary route when rolling out new quants to 100% traffic.
- Chasing smallest file — Q2 and Q3 win downloads, lose customers on subtle reasoning tasks.
- Forgetting tokenizer parity — re-quantizing after tokenizer or chat-template changes without re-eval.
Production checklist
- Define accuracy SLOs on real prompts (schema pass rate, human spot checks).
- Measure FP16 baseline memory, latency p50/p99, and quality metrics.
- Pick format matched to serving engine and hardware generation.
- Run PTQ (AWQ/GPTQ) with calibration data sampled from production logs (redacted).
- Compare INT8, INT4, and mixed-bit presets; do not default to lowest bit count.
- Load-test concurrent sessions including long-context RAG threads.
- Enable KV cache quantization only after per-layer attention sanity checks.
- Pin quant manifest: base model hash, method, bit width, eval scores.
- Deploy canary with automatic rollback to FP16 on error-rate spikes.
- Re-run eval when updating base weights, adapters, or chat templates.
Key takeaways
- Quantization maps FP weights to low-bit integers with scale metadata, shrinking memory and bandwidth.
- PTQ (GPTQ, AWQ, GGUF) is fast for teams serving open weights; QAT recovers edge cases when PTQ misses SLOs.
- Weight-only INT4 is the default local-GPU story; FP8 and KV quant matter at datacenter scale.
- Validate on your tasks — structured output and long context break before multiple-choice benchmarks do.
- Pair quantization with routing and cost controls so quality fallbacks stay affordable.
Related reading
- LLM cost optimization explained — token budgets and routing beyond compression
- Small language models explained — when a smaller base beats a quantized giant
- LLM inference serving explained — batching, vLLM, and SLO tuning
- LLM KV cache explained — memory scaling during long contexts