Guide

LLM embedding quantization explained

Harbor Support indexed 12.4 million help-desk chunks at 1024-dimensional float32 — roughly 50 GB of raw vectors before metadata. The HNSW graph alone consumed another 18 GB of RAM on each replica, and p95 retrieval latency crept past 180 ms during peak hours. Swapping to a smaller embedding model would have hurt recall on technical tickets; re-sharding across more nodes doubled monthly infra cost. The team instead applied scalar INT8 quantization on stored vectors and IVF-PQ coarse quantization on the index structure. Index RAM fell 62%, p95 query latency dropped to 106 ms, and recall@10 on a held-out ticket set declined only 2.1 percentage points — well inside the reranker’s recovery band.

Embedding quantization compresses dense vectors used for semantic search, clustering, and RAG retrieval. Unlike weight quantization in LLM inference, embedding quantization targets the stored corpus and sometimes the live query vector. The goal is smaller indexes, faster distance math, and cheaper replication — without collapsing retrieval quality below what a reranker or hybrid BM25 blend can repair. This guide covers scalar INT8, product quantization, binary embeddings, Matryoshka truncation as an alternative, ANN index interactions, a Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.

Why quantize embeddings at all

Production RAG pipelines store one vector per chunk, passage, or document block. At scale the costs are predictable:

  • Memory — float32 vectors use 4 bytes per dimension; 1M chunks at dim 1536 = ~6 GB vectors alone, before graph overhead.
  • Replication — every read replica loads the full index; quantization directly cuts cloud RAM bills.
  • Bandwidth — shard transfers, backups, and cold starts move fewer bytes with compressed representations.
  • CPU/GPU distance math — INT8 dot products and Hamming popcount run faster than float32 cosine on large candidate sets.

Quantization is not free: each bit removed introduces approximation error. The engineering question is whether that error sits before or after your quality floor. Most teams quantize stored corpus vectors but keep query embeddings at float32 (or float16) until the final distance step, then re-rank top-k hits with full-precision vectors fetched from object storage or a sidecar cache.

Scalar quantization (INT8 / float16)

Scalar quantization maps each float dimension independently to a lower-precision type. INT8 is the most common storage format:

  1. Compute per-dimension (or per-vector) min and max on a calibration sample of embeddings.
  2. Scale each dimension into [-128, 127] with a learned or fixed scale factor s and zero-point z.
  3. At query time, quantize the query vector with the same calibration, compute approximate inner product or L2 distance in INT8, then optionally dequantize top candidates for exact rescoring.

Float16 (half precision) is simpler — halve storage with minimal recall loss on modern embedding models trained with normalized outputs. Many vector databases (Qdrant, Milvus, pgvector with halfvec) support float16 natively. INT8 typically saves 4× versus float32; float16 saves 2×.

Scalar methods preserve the vector’s general direction in embedding space, which is why recall@10 often drops less than 3% when followed by a cross-encoder reranker. They work best when embeddings are L2-normalized and when calibration samples match production domain distribution — a mismatch between calibration (English marketing pages) and queries (CJK support tickets) can inflate error disproportionately.

Product quantization (PQ) and IVF-PQ

Product quantization

Product quantization splits each d-dimensional vector into m subvectors of length d/m. For each subspace, k-means trains 2^nbits centroids (often 256 for 8-bit codes). A vector becomes m small integer codes — one per subspace — and distance is approximated by table lookups instead of full dot products.

PQ achieves higher compression than scalar INT8 (often 8–32× versus raw float32) but with coarser approximation. It shines in billion-scale indexes where exact neighbor search is already approximate (ANN). Faiss IndexIVFPQ is the canonical pattern: an inverted file (IVF) clusters vectors into nlist buckets; PQ compresses vectors inside each bucket.

When PQ beats scalar INT8

  • Corpus exceeds available RAM even after INT8 — PQ plus disk-backed IVF is the next lever.
  • Latency target tolerates higher recall@10 loss because a reranker or cross-encoder stage recovers precision on 50–100 candidates.
  • Embedding dimension is high (1024–4096) where subspace factorization captures redundancy.

Tune m (number of subquantizers) and nbits against a fixed eval set. Doubling m usually improves recall at the cost of larger code storage and slower encoding.

Binary embeddings and one-bit quantization

Binary quantization maps each dimension (or each projected dimension after a random or learned rotation) to a single bit: 1 if positive, 0 otherwise. Distance becomes Hamming distance or XOR popcount — extremely fast on CPUs with SIMD and on FPGA/ASIC search accelerators.

Binary codes compress 32× versus float32 but recall@10 can fall 10–25% without rescoring. Production patterns:

  • Two-stage retrieval — binary index returns top-1000; rescore with float32 stored in a compact side table.
  • Learned binarization — models like binarized Sentence-BERT or Matryoshka-trained models emit quantization-friendly layouts.
  • Hybrid pre-filter — binary ANN narrows candidates; BM25 or metadata filters run on the reduced set.

Binary is appropriate for first-pass candidate generation at very large scale (100M+ vectors) where float32 HNSW is economically infeasible, not as the sole retrieval signal in high-stakes RAG.

Matryoshka truncation as an alternative

Some modern embedding models support Matryoshka representation learning (MRL): early dimensions carry most semantic signal, so truncating to 256 or 512 dims retains strong recall without post-hoc quantization noise.

MRL truncation is not bitwise compression — you still store float values — but dimension reduction often beats aggressive PQ on recall while cutting storage 2–4×. Compare on your eval set:

  • 1024-dim float32 → 256-dim float16 Matryoshka slice
  • 1024-dim float32 → 1024-dim INT8 scalar

The winner depends on model training; MRL-native models (nomic-embed, some OpenAI and Google embedding APIs) favor truncation. Legacy models may need INT8 or PQ instead. See embedding model selection for MTEB benchmarks at multiple dimensions.

ANN index interactions: HNSW, IVF, and recall

Quantization changes how approximate nearest neighbor (ANN) graphs behave:

  • HNSW + scalar INT8 — supported in Qdrant, Weaviate, and Milvus; graph edges computed on quantized distances. Increase efConstruction and efSearch slightly to recover recall lost to compression.
  • IVF-PQ — Faiss default for billion-scale; query-time nprobe trades latency for recall. Start with nprobe=16 on 4096-list IVF and sweep on held-out queries.
  • Disk ANN — DiskANN and ScaNN combine PQ with graph traversal on SSD; quantization is mandatory, not optional.

Always measure recall@k and MRR on a labeled query set after index changes — not just p50 latency. A 40% latency win that drops recall@10 below your reranker’s recovery threshold increases end-to-end hallucination rate in RAG evaluation even if dashboards look green.

Harbor Support vector index refactor

Harbor Support’s RAG stack used BGE-large-en-v1.5 at 1024 dimensions, float32 storage, and HNSW with m=16, ef=128. The refactor proceeded in four steps:

  1. Calibration sample — 50k stratified chunks (by product, language, ticket severity) to fit per-channel INT8 scales.
  2. Stored INT8 + query float32 — corpus vectors quantized; live queries stayed float32 until distance computation, then symmetric INT8 comparison inside HNSW.
  3. Rescore top-50 — full float32 vectors for top-50 hits loaded from a compressed sidecar (zstd float16 blobs keyed by chunk ID).
  4. Cross-encoder rerank — unchanged ms-marco MiniLM reranker on top-10 after rescore.

End-to-end answer accuracy on 800 human-graded tickets moved from 78.4% to 77.1% — within noise — while infra RAM per replica fell from 68 GB to 26 GB. Attempting PQ on the same index without rescoring dropped recall@10 by 8.7%; the team kept PQ in reserve for a future 50M-chunk archive tier.

Technique decision table

Goal Prefer Avoid
Halve RAM with minimal recall loss Float16 storage or Matryoshka 512→256 truncation Aggressive PQ with nbits=4
4× compression, reranker downstream Scalar INT8 + top-50 float rescore Binary-only index without rescore
Billion-vector archive tier IVF-PQ + two-stage float rescore Full float32 HNSW on single host
Ultra-low latency on CPU Binary first stage + INT8 rescore High-dim float32 brute force
Domain with non-English scripts Calibration on multilingual sample; float16 first English-only INT8 calibration
Regulatory audit of retrieval Store float16 sidecar for disputed hits Lossy PQ as sole evidence trail
Cheaper than new embedding API tier Quantize existing index Swap to smaller model without re-embed eval

Common pitfalls

  • Quantizing queries without calibrating on queries — query and document distributions differ; use separate scales or keep queries at float32.
  • Skipping rescore on PQ indexes — PQ-only top-10 often misses the true nearest neighbor outside the probed IVF lists.
  • Calibrating on the wrong domain — INT8 scales fit on Wikipedia do not transfer to legal contracts or code snippets.
  • Ignoring normalization — cosine similarity on unnormalized INT8 vectors biases toward high-magnitude dimensions.
  • Comparing latency without recall — faster garbage retrieval is still garbage.
  • Re-quantizing without re-tuning HNSW ef — compressed distances need wider graph search to match recall.
  • Mixing quantization formats across replicas — blue/green deploys must rebuild indexes, not hot-swap codebooks.
  • Assuming Matryoshka dims are interchangeable — truncate only on models explicitly trained for MRL.

Production checklist

  • Build a labeled eval set (500+ query–document pairs) before changing storage format.
  • Measure recall@5, recall@10, and MRR on the eval set at each compression tier.
  • Fit INT8 calibration on a stratified sample matching production language and topic mix.
  • Keep float16 or float32 sidecar for top-k rescoring after ANN retrieval.
  • Re-tune HNSW efSearch or IVF nprobe after quantization deploy.
  • Log compression format version in index metadata for reproducible rebuilds.
  • Run A/B on end-to-end RAG answer accuracy, not just retrieval metrics.
  • Test cold-start index load time and replica RAM after compression.
  • Document recall budget: maximum acceptable recall@10 drop before rollback.
  • Plan PQ migration path before corpus 10× growth makes float16 insufficient.

Key takeaways

  • Embedding quantization shrinks RAG indexes and speeds distance math; scalar INT8 and float16 are the first levers most teams should pull.
  • Product quantization enables billion-scale ANN but needs two-stage rescoring to protect recall.
  • Matryoshka truncation on MRL-trained models often beats post-hoc quantization on recall-per-byte.
  • Always evaluate recall@k and end-to-end answer quality together — latency gains that drop retrieval below reranker recovery thresholds hurt users.
  • Harbor Support cut replica RAM 62% with INT8 corpus storage, float32 query encoding, and top-50 rescore — 2.1% recall@10 loss, within reranker tolerance.

Related reading