Guide
LLM embedding batch inference explained
Harbor Analytics reindexed 2.4 million document chunks after a embedding model swap. The first attempt called a hosted embeddings API one chunk at a time. Throughput plateaued at 62 vectors per second; the full reindex would have taken 11 hours and blown the maintenance window. The platform team deployed a self-hosted Text Embeddings Inference (TEI) server with length-bucketed dynamic batching. Peak throughput hit 847 vectors per second on a single L4 GPU. Total reindex time dropped to 47 minutes with identical recall@5 on the golden query set.
Embedding batch inference is the practice of grouping many texts into one forward pass through an encoder model instead of embedding strings serially. Unlike autoregressive LLM decode, embedding models are bidirectional — every token in a sequence can attend to every other token in parallel. GPUs are built for wide matrix math; batching amortizes kernel launch overhead and keeps tensor cores saturated. This guide covers why batching matters for RAG ingest and query paths, static vs dynamic batching, padding and length bucketing, memory limits and optimal batch sizes, self-hosted vs API serving patterns, the Harbor Analytics refactor, a technique decision table versus one-at-a-time API calls, pitfalls, and a production checklist — building on embedding fundamentals and chunking strategy.
Why serial embedding is slow
A single embedding forward pass on modern hardware often completes in 2–15 ms for a 512-token sequence. That sounds fast until you multiply by millions of chunks. The real cost is not FLOPs alone — it is underutilization. Sending one short paragraph per request leaves most of the GPU idle between kernel launches, HTTP round trips, and Python interpreter overhead.
Embedding workloads differ from chat inference in three ways that shape batching strategy:
| Property | Embedding encoder | Autoregressive LLM decode |
|---|---|---|
| Attention pattern | Full bidirectional (BERT-style or encoder-only) | Causal (one new token per step) |
| Output | Fixed-size vector per input (pooling over tokens) | Variable-length token stream |
| Batching benefit | High — independent sequences in one matmul | Moderate — sequences diverge in length over time |
For RAG, embedding appears twice: ingest (batch millions of document chunks into a vector database) and query (embed one or few user questions per request). Ingest is embarrassingly parallel and batch-friendly. Query paths are latency-sensitive but still benefit from micro-batching when traffic is concurrent.
Static vs dynamic batching
Static batching
You collect exactly N texts (or wait until a timeout), pad them to
the longest sequence in the batch, and run one forward pass. Static batching is
simple to implement in Python with Hugging Face SentenceTransformer
or transformers pipelines:
embeddings = model.encode(
texts,
batch_size=64,
show_progress_bar=True,
normalize_embeddings=True
)
Static batching works well for offline ingest jobs where you control the full corpus and can sort chunks by token length before batching.
Dynamic batching
A server accumulates incoming requests for a short window (typically 1–50 ms), groups them into a batch up to a memory limit, runs inference, and returns results to each caller. TEI, NVIDIA Triton, and vLLM's embedding mode all support dynamic batching for online serving. This is how you serve concurrent query embeddings without one-request-per-forward-pass waste.
Dynamic batching trades a small latency increase (the wait window) for higher throughput. For ingest pipelines implemented as streaming workers, the same pattern applies: workers push chunks into a shared queue; a GPU consumer drains the queue in batches.
Padding, bucketing and memory
Batched sequences must share tensor dimensions. Shorter texts are padded to match the longest sequence in the batch. Padding tokens still participate in attention unless masked — so a batch mixing a 32-token title with a 512-token paragraph wastes compute on 480 padding positions for the short text.
Length bucketing sorts or routes texts into bins (e.g. 0–64, 65–128, 129–256, 257–512 tokens) and batches only within a bin. Harbor Analytics used four buckets for their 512-token max chunk size. Padding waste fell from 41% to 8% of total tokens processed, which alone accounted for a 1.6× throughput gain before any hardware change.
GPU memory scales roughly with batch_size × max_seq_len × hidden_dim
for activations. Finding the optimal batch size is an empirical sweep:
- Start at batch 32 for a 768-dim / 512-max-seq model on 16 GB VRAM.
- Double until OOM, then back off 25%.
- Log vectors/sec and memory headroom; peak throughput often sits below max batch.
- Use FP16 or BF16 inference; many embedding models support half precision natively.
For very large ingest jobs, embedding quantization (INT8 or binary) can double effective batch capacity with minimal recall impact on retrieval benchmarks.
Ingest vs query paths
The two embedding paths have different SLOs and batch shapes:
| Path | Goal | Typical batching | Failure mode |
|---|---|---|---|
| Ingest / reindex | Maximize vectors per second | Large static batches (64–256), length-sorted | OOM from oversized batch; silent truncation if max_length too low |
| Online query | Minimize P99 latency | Small dynamic batches (4–32), short wait window | Queue buildup under burst; cold-start on scaled-to-zero |
| Hybrid (ingest + live queries) | Fair share | Separate GPU pools or priority queues | Reindex starves query latency |
Harbor Analytics runs ingest on a dedicated TEI replica and query on a smaller always-warm instance. During full reindex, query traffic never contends with million-chunk backfills. For smaller teams, time-slicing (reindex off-peak) or a single server with ingest deprioritized in the queue works if query volume is low.
Serving options
Hosted embedding APIs
OpenAI, Cohere, Voyage, and others expose embedding endpoints that batch internally. You send up to 2,048 inputs per request on some providers. This is the fastest path when you lack GPU capacity — but per-token pricing adds up on million-chunk corpora, and you cannot tune batch internals.
Text Embeddings Inference (TEI)
Hugging Face TEI is a Rust server optimized for sentence-transformer models.
It supports dynamic batching, token-based truncation, Prometheus metrics, and
OpenAI-compatible /v1/embeddings routes. Harbor Analytics deployed
ghcr.io/huggingface/text-embeddings-inference with
--max-batch-tokens 16384 and --max-client-batch-size 256.
Sentence Transformers / custom Python
Fine for notebooks and one-off jobs. Production ingest at scale usually moves to a dedicated server once reindex exceeds a few hours or must run on a schedule.
vLLM embedding mode
If you already run vLLM for chat, its embedding task type can serve encoder models with similar batching infrastructure. Useful when you want one ops stack for both chat and retrieval encoders.
Harbor Analytics ingest refactor
The refactor had four stages:
- Chunk audit — verified all 2.4M chunks respected the 512-token limit from the chunking pipeline; 0.3% were re-split.
- Length sort — pre-sorted chunk IDs by token count into four buckets before streaming to TEI.
- Batched upsert — embedded in batches of 128, upserted to
Qdrant in batches of 1,000 vectors with
wait=falsefor async indexing. - Validation — ran 500 golden queries; recall@5 matched the serial baseline at 94.2%.
End-to-end ingest rose from 62 to 847 vectors/sec (13.7×). Cost per reindex fell from ~$180 in API fees to ~$4 in GPU time on a reserved L4.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Serial API calls (one text per request) | Prototypes, <10K chunks, no GPU | Million-chunk corpora; cost-sensitive reindex |
| Bulk API calls (provider batches inputs) | Medium corpora; no infra team | Need custom model or sub-ms query SLO tuning |
| Static Python batching (offline job) | Scheduled reindex; single-machine GPU | Concurrent online query load on same GPU |
| TEI / Triton dynamic batching | Production query + ingest; concurrent traffic | Corpus fits in memory on CPU-only box |
| Length bucketing + sorted ingest | Variable chunk sizes; padding waste >20% | All chunks are near uniform length |
| Separate ingest vs query pools | Large reindex while serving live traffic | Dev/staging with no SLO |
Pitfalls
- Truncation without logging — batches silently cut text at
max_length; retrieval quality drops on long chunks. - Missing query prefixes — asymmetric models (E5, BGE) need
query:/passage:prefixes; batching does not fix wrong prefixes. - Normalization inconsistency — L2-normalize at embed time or at index time, not both differently across batches.
- Padding-dominated batches — mixing lengths without bucketing can make batch 128 slower than batch 32.
- Reindex during peak query — shared GPU causes P99 query latency spikes.
- Dimension mismatch on upsert — batched pipeline swaps model but vector DB schema still expects old dimension.
- Ignoring deduplication — re-embedding unchanged chunks wastes batch capacity; content-hash skip lists save hours.
Production checklist
- Measure baseline ingest throughput (vectors/sec) before optimizing.
- Sort or bucket chunks by token length before large static batches.
- Sweep batch size until OOM; back off 25% for headroom.
- Enable FP16/BF16 inference on supported models.
- Log truncation count and max token length per batch.
- Apply correct asymmetric prefixes for query vs document paths.
- L2-normalize embeddings consistently with index metric (cosine vs dot).
- Separate ingest and query GPU pools if reindex overlaps live traffic.
- Export vectors/sec, batch size, and GPU memory dashboards.
- Validate recall@k on golden queries after any batching or model change.
- Content-hash skip unchanged chunks on incremental reindex.
Key takeaways
- Embedding encoders are bidirectional and highly batchable — serial one-text requests waste most GPU capacity on ingest workloads.
- Length bucketing cuts padding waste; Harbor Analytics recovered 1.6× throughput before any hardware change.
- Ingest favors large static batches; query paths need small dynamic batches with short wait windows.
- Self-hosted TEI or Triton pays off on million-chunk corpora where API per-token costs compound.
- Harbor Analytics reindex dropped from 11 hours to 47 minutes (62 → 847 vectors/sec) with bucketed TEI batching and unchanged recall@5.
Related reading
- LLM embeddings explained — vectors, similarity, and model families
- LLM embedding model selection explained — choosing the encoder before you batch it
- RAG chunking strategies explained — chunk sizes that determine batch shapes
- Vector databases explained — where batched vectors land at upsert time