Guide
LLM model quantization and inference explained: shrinking weights without breaking answers
Training a large language model is expensive once. Serving it at scale is expensive every day — GPU memory, electricity, and milliseconds per token add up fast. A 70-billion-parameter model stored in full 32-bit precision needs roughly 280 GB just for weights, before activations, optimizer state, or the growing KV cache that remembers prior tokens during generation. Quantization stores those same weights in fewer bits per parameter — INT8, INT4, or mixed schemes — so a model fits on one consumer GPU, runs faster, or serves more concurrent users. The trade-off is subtle: aggressive compression can dull reasoning, hallucinate more on math, or fail on rare tokens. This guide explains how transformer inference actually spends time and memory, what post-training quantizers like GPTQ and AWQ do, how serving engines batch requests, and how to benchmark quality after you shrink a checkpoint — whether you are self-hosting Llama-class models or sizing an API budget next to RAG and fine-tuning.
Where inference time and memory go
Autoregressive LLMs generate one token at a time. Each new token requires a full forward pass through every layer of the network. Two costs dominate production deployments:
- Weight memory — static size proportional to parameter count times bytes per weight.
- KV cache memory — grows with batch size, number of layers, hidden dimension, head count, and sequence length (prompt plus everything generated so far). Long context windows can exceed weight memory on their own.
On modern GPUs, matrix multiplications are often memory-bandwidth bound once weights fit in VRAM: reading billions of bytes per token matters more than raw FLOPs. Halving weight precision roughly halves bytes moved per layer — which is why INT8 and INT4 inference can be faster even when arithmetic is "less accurate." CPU and Apple-silicon inference (GGUF via llama.cpp) follow the same logic with different kernels.
Prefill vs decode
Serving stacks split work into prefill (processing the prompt in parallel) and decode (generating tokens one-by-one). Prefill is compute-heavy and benefits from large matrix tiles; decode is latency-sensitive and often runs small batches. Optimizations like continuous batching, paged attention, and speculative decoding target different phases — quantization helps both, but decode latency is what users feel in chat UIs.
Precision formats: FP32 down to INT4
Think of precision as how many distinct values each weight can represent:
- FP32 — training reference; rarely needed for inference on modern stacks.
- FP16 / BF16 — default "full precision" inference on NVIDIA datacenter GPUs. BF16 shares exponent range with FP32 and is common in training; FP16 is ubiquitous in vLLM and TensorRT-LLM paths.
- INT8 — eight bits per weight with a per-tensor or per-channel scale (and sometimes zero-point). Roughly 2x smaller than FP16 with small perplexity drift on well-calibrated models.
- INT4 / NF4 — four bits per weight; 4x smaller than FP16. Quality loss shows up first on needle-in-haystack retrieval, multi-step arithmetic, and low-resource languages unless calibration data is representative.
Mixed-precision kernels keep sensitive layers (often the first and last, sometimes attention projections) in FP16 while quantizing MLP blocks harder. Frameworks expose this as "4-bit weights, 16-bit compute" — the activations stay wider for numerical stability while weights stream from VRAM in compressed form.
Post-training quantization: GPTQ, AWQ, and GGUF
You do not need to retrain from scratch to quantize. Post-training quantization (PTQ) starts from a finished checkpoint and searches for low-bit representations that minimize error on a calibration set of sample prompts.
GPTQ (Generalized Post-Training Quantization)
GPTQ quantizes weight matrices column-wise using approximate second-order information (Hessian awareness) so that layers whose outputs are sensitive to specific weights keep higher fidelity. It produces checkpoints compatible with ExLlama, AutoGPTQ, and many Hugging Face pipelines. GPTQ shines on NVIDIA GPUs when you want INT4 weights with mature CUDA kernels.
AWQ (Activation-Aware Weight Quantization)
AWQ notices that a small fraction of weights disproportionately affect activations. It protects "salient" channels and quantizes the rest more aggressively. In practice AWQ often matches or beats GPTQ on quality at the same bit width, especially on instruction-tuned models where a few attention pathways carry format-following behavior.
GGUF and llama.cpp
The GGUF format (successor to GGML) packages quantized tensors for efficient
CPU and Metal inference via llama.cpp. Quantization types are labeled Q4_K_M,
Q5_K_S, and similar — different mixes of 4-bit and 5-bit super-blocks with
importance weighting. GGUF is the pragmatic choice for local chat on a laptop without a discrete
GPU, not for maximum throughput on an A100 cluster.
Quantization-aware training (QAT)
When PTQ quality is unacceptable — common for sub-7B models on niche domains — teams simulate low-bit arithmetic during fine-tuning so the network learns weights that survive rounding. QAT costs more than PTQ but less than training from scratch. It pairs naturally with QLoRA, which already trains adapters in 4-bit base weights.
Serving optimizations beyond quantization
Shrinking weights is necessary but not sufficient for cheap, fast LLM APIs:
- Continuous batching — vLLM-style schedulers add new requests to an in-flight batch as others finish, improving GPU utilization versus static batching.
- PagedAttention — stores KV cache in non-contiguous blocks like virtual memory, reducing fragmentation when sequences differ in length.
- Tensor parallelism / pipeline parallelism — splits one model across multiple GPUs when even INT4 weights do not fit on a single card.
- Speculative decoding — a small draft model proposes several tokens; the large model verifies them in parallel. Good draft models cut latency when acceptance rates stay high.
- Prefix caching — reuse KV blocks for identical system prompts or RAG document prefixes across users, slashing prefill cost in multi-tenant SaaS.
Managed APIs (OpenAI, Anthropic, Google) hide these details but bill you per token — understanding prefill vs decode pricing explains why long retrieved contexts in RAG pipelines hurt latency and cost even when output length is short.
Quality checks after you quantize
Perplexity on a held-out corpus is a quick sanity check but a poor proxy for product quality. Before shipping a quantized checkpoint:
- Run your golden eval set — the same prompts you use for regression testing after prompt or model changes (see LLM evaluation).
- Stress structured output — JSON mode, tool calls, and regex-constrained fields fail first when logits compress.
- Test long-context retrieval if you rely on big windows; INT4 can blur distant attention more than FP16.
- Compare latency and throughput at realistic concurrency, not batch=1 micro-benchmarks alone.
- Log token-level refusals and hallucination rate in shadow traffic before full cutover.
If INT4 fails evals but INT8 passes, ship INT8 — the hardware savings still matter, and reliability beats an extra 10% VRAM win.
Decision framework: when to quantize what
| Scenario | Typical choice |
|---|---|
| Prototype on a MacBook, no GPU | GGUF Q4_K_M or Q5 via llama.cpp |
| Single A100, 70B instruct model | AWQ or GPTQ INT4 + vLLM |
| Quality-critical legal/medical assistant | FP16 or INT8 with full eval suite |
| High-QPS chat API, cost sensitive | INT4 weights + continuous batching + prefix cache |
| Domain fine-tune on small base model | QLoRA training, then AWQ PTQ for deploy |
Quantization changes how you run a model, not what it knows. Factual gaps still need RAG; tone and format still need fine-tuning or careful prompting. Treat compression as infrastructure — pick bit width after you know which checkpoint and eval bar matter for your product.
Key takeaways
- Inference cost is mostly memory bandwidth and KV cache — not just parameter count.
- INT8 and INT4 PTQ (GPTQ, AWQ, GGUF) shrink weights 2–4x with tunable quality loss.
- Serving engines add continuous batching, paged KV, and speculative decode on top of quantization.
- Always re-run product evals after changing precision; perplexity alone is not enough.
Related reading
- Transformer architecture explained — where attention and FFN layers spend compute during decode
- LLM fine-tuning explained — QLoRA trains on quantized bases; PTQ compresses the result for deploy
- LLM context windows explained — KV cache growth and why long prompts dominate VRAM
- LLM evaluation and benchmarking explained — golden sets to rerun after every quantization experiment