Guide

LLM embedding batch inference explained

Harbor Analytics reindexed 2.4 million document chunks after a embedding model swap. The first attempt called a hosted embeddings API one chunk at a time. Throughput plateaued at 62 vectors per second; the full reindex would have taken 11 hours and blown the maintenance window. The platform team deployed a self-hosted Text Embeddings Inference (TEI) server with length-bucketed dynamic batching. Peak throughput hit 847 vectors per second on a single L4 GPU. Total reindex time dropped to 47 minutes with identical recall@5 on the golden query set.

Embedding batch inference is the practice of grouping many texts into one forward pass through an encoder model instead of embedding strings serially. Unlike autoregressive LLM decode, embedding models are bidirectional — every token in a sequence can attend to every other token in parallel. GPUs are built for wide matrix math; batching amortizes kernel launch overhead and keeps tensor cores saturated. This guide covers why batching matters for RAG ingest and query paths, static vs dynamic batching, padding and length bucketing, memory limits and optimal batch sizes, self-hosted vs API serving patterns, the Harbor Analytics refactor, a technique decision table versus one-at-a-time API calls, pitfalls, and a production checklist — building on embedding fundamentals and chunking strategy.

Why serial embedding is slow

A single embedding forward pass on modern hardware often completes in 2–15 ms for a 512-token sequence. That sounds fast until you multiply by millions of chunks. The real cost is not FLOPs alone — it is underutilization. Sending one short paragraph per request leaves most of the GPU idle between kernel launches, HTTP round trips, and Python interpreter overhead.

Embedding workloads differ from chat inference in three ways that shape batching strategy:

Property	Embedding encoder	Autoregressive LLM decode
Attention pattern	Full bidirectional (BERT-style or encoder-only)	Causal (one new token per step)
Output	Fixed-size vector per input (pooling over tokens)	Variable-length token stream
Batching benefit	High — independent sequences in one matmul	Moderate — sequences diverge in length over time

For RAG, embedding appears twice: ingest (batch millions of document chunks into a vector database) and query (embed one or few user questions per request). Ingest is embarrassingly parallel and batch-friendly. Query paths are latency-sensitive but still benefit from micro-batching when traffic is concurrent.

Static vs dynamic batching

Static batching

You collect exactly N texts (or wait until a timeout), pad them to the longest sequence in the batch, and run one forward pass. Static batching is simple to implement in Python with Hugging Face SentenceTransformer or transformers pipelines:

embeddings = model.encode(
    texts,
    batch_size=64,
    show_progress_bar=True,
    normalize_embeddings=True
)

Static batching works well for offline ingest jobs where you control the full corpus and can sort chunks by token length before batching.

Dynamic batching

A server accumulates incoming requests for a short window (typically 1–50 ms), groups them into a batch up to a memory limit, runs inference, and returns results to each caller. TEI, NVIDIA Triton, and vLLM's embedding mode all support dynamic batching for online serving. This is how you serve concurrent query embeddings without one-request-per-forward-pass waste.

Dynamic batching trades a small latency increase (the wait window) for higher throughput. For ingest pipelines implemented as streaming workers, the same pattern applies: workers push chunks into a shared queue; a GPU consumer drains the queue in batches.

Padding, bucketing and memory

Batched sequences must share tensor dimensions. Shorter texts are padded to match the longest sequence in the batch. Padding tokens still participate in attention unless masked — so a batch mixing a 32-token title with a 512-token paragraph wastes compute on 480 padding positions for the short text.

Length bucketing sorts or routes texts into bins (e.g. 0–64, 65–128, 129–256, 257–512 tokens) and batches only within a bin. Harbor Analytics used four buckets for their 512-token max chunk size. Padding waste fell from 41% to 8% of total tokens processed, which alone accounted for a 1.6× throughput gain before any hardware change.

GPU memory scales roughly with batch_size × max_seq_len × hidden_dim for activations. Finding the optimal batch size is an empirical sweep:

Start at batch 32 for a 768-dim / 512-max-seq model on 16 GB VRAM.
Double until OOM, then back off 25%.
Log vectors/sec and memory headroom; peak throughput often sits below max batch.
Use FP16 or BF16 inference; many embedding models support half precision natively.

For very large ingest jobs, embedding quantization (INT8 or binary) can double effective batch capacity with minimal recall impact on retrieval benchmarks.

Ingest vs query paths

The two embedding paths have different SLOs and batch shapes:

Path	Goal	Typical batching	Failure mode
Ingest / reindex	Maximize vectors per second	Large static batches (64–256), length-sorted	OOM from oversized batch; silent truncation if max_length too low
Online query	Minimize P99 latency	Small dynamic batches (4–32), short wait window	Queue buildup under burst; cold-start on scaled-to-zero
Hybrid (ingest + live queries)	Fair share	Separate GPU pools or priority queues	Reindex starves query latency

Harbor Analytics runs ingest on a dedicated TEI replica and query on a smaller always-warm instance. During full reindex, query traffic never contends with million-chunk backfills. For smaller teams, time-slicing (reindex off-peak) or a single server with ingest deprioritized in the queue works if query volume is low.

Serving options

Hosted embedding APIs

OpenAI, Cohere, Voyage, and others expose embedding endpoints that batch internally. You send up to 2,048 inputs per request on some providers. This is the fastest path when you lack GPU capacity — but per-token pricing adds up on million-chunk corpora, and you cannot tune batch internals.

Text Embeddings Inference (TEI)

Hugging Face TEI is a Rust server optimized for sentence-transformer models. It supports dynamic batching, token-based truncation, Prometheus metrics, and OpenAI-compatible /v1/embeddings routes. Harbor Analytics deployed ghcr.io/huggingface/text-embeddings-inference with --max-batch-tokens 16384 and --max-client-batch-size 256.

Sentence Transformers / custom Python

Fine for notebooks and one-off jobs. Production ingest at scale usually moves to a dedicated server once reindex exceeds a few hours or must run on a schedule.

vLLM embedding mode

If you already run vLLM for chat, its embedding task type can serve encoder models with similar batching infrastructure. Useful when you want one ops stack for both chat and retrieval encoders.

Harbor Analytics ingest refactor

The refactor had four stages:

Chunk audit — verified all 2.4M chunks respected the 512-token limit from the chunking pipeline; 0.3% were re-split.
Length sort — pre-sorted chunk IDs by token count into four buckets before streaming to TEI.
Batched upsert — embedded in batches of 128, upserted to Qdrant in batches of 1,000 vectors with wait=false for async indexing.
Validation — ran 500 golden queries; recall@5 matched the serial baseline at 94.2%.

End-to-end ingest rose from 62 to 847 vectors/sec (13.7×). Cost per reindex fell from ~$180 in API fees to ~$4 in GPU time on a reserved L4.

Technique decision table

Approach	Best when	Skip when
Serial API calls (one text per request)	Prototypes, <10K chunks, no GPU	Million-chunk corpora; cost-sensitive reindex
Bulk API calls (provider batches inputs)	Medium corpora; no infra team	Need custom model or sub-ms query SLO tuning
Static Python batching (offline job)	Scheduled reindex; single-machine GPU	Concurrent online query load on same GPU
TEI / Triton dynamic batching	Production query + ingest; concurrent traffic	Corpus fits in memory on CPU-only box
Length bucketing + sorted ingest	Variable chunk sizes; padding waste >20%	All chunks are near uniform length
Separate ingest vs query pools	Large reindex while serving live traffic	Dev/staging with no SLO

Pitfalls

Truncation without logging — batches silently cut text at max_length; retrieval quality drops on long chunks.
Missing query prefixes — asymmetric models (E5, BGE) need query: / passage: prefixes; batching does not fix wrong prefixes.
Normalization inconsistency — L2-normalize at embed time or at index time, not both differently across batches.
Padding-dominated batches — mixing lengths without bucketing can make batch 128 slower than batch 32.
Reindex during peak query — shared GPU causes P99 query latency spikes.
Dimension mismatch on upsert — batched pipeline swaps model but vector DB schema still expects old dimension.
Ignoring deduplication — re-embedding unchanged chunks wastes batch capacity; content-hash skip lists save hours.

Production checklist

Measure baseline ingest throughput (vectors/sec) before optimizing.
Sort or bucket chunks by token length before large static batches.
Sweep batch size until OOM; back off 25% for headroom.
Enable FP16/BF16 inference on supported models.
Log truncation count and max token length per batch.
Apply correct asymmetric prefixes for query vs document paths.
L2-normalize embeddings consistently with index metric (cosine vs dot).
Separate ingest and query GPU pools if reindex overlaps live traffic.
Export vectors/sec, batch size, and GPU memory dashboards.
Validate recall@k on golden queries after any batching or model change.
Content-hash skip unchanged chunks on incremental reindex.

Key takeaways

Embedding encoders are bidirectional and highly batchable — serial one-text requests waste most GPU capacity on ingest workloads.
Length bucketing cuts padding waste; Harbor Analytics recovered 1.6× throughput before any hardware change.
Ingest favors large static batches; query paths need small dynamic batches with short wait windows.
Self-hosted TEI or Triton pays off on million-chunk corpora where API per-token costs compound.
Harbor Analytics reindex dropped from 11 hours to 47 minutes (62 → 847 vectors/sec) with bucketed TEI batching and unchanged recall@5.