Guide

LLM agent inference batching and request coalescing systems explained

Harbor Support operates a ticket-triage agent fleet: classify intent, draft a reply, run a policy check, and enqueue escalation when needed. At Monday-morning peak the fleet issued roughly 1,800 model calls per minute, each as an isolated HTTP request to a self-hosted inference cluster. GPU utilization averaged 19% while p95 step latency reached 3.8 seconds and inference cost per run climbed to $0.41 against a $0.12 budget. FinOps flagged 34% of weekly runs as budget breaches — not because prompts were wasteful, but because the runtime treated every agent step as a solo trip to an under-filled GPU.

Inference batching and request coalescing group compatible model calls into shared forward passes so tensor cores stay busy. This is distinct from prompt caching (reuse prefill across turns with identical prefixes) and from parallel tool execution (overlap non-model I/O). Coalescing answers: “can we schedule these in-flight completions together without breaking SLAs or tenant isolation?” This guide covers the scheduler stack, static vs continuous batching, coalescing windows, priority lanes, streaming disaggregation, the Harbor Support refactor, a decision table versus caching and routing alone, pitfalls, and a production checklist tied to cost attribution and streaming delivery.

Why agents amplify the batching problem

A single chat completion is easy to reason about: one user, one prompt, one response. Agent runtimes multiply calls:

  • Multi-step loops — classify, plan, tool, summarize: 3–8 model hops per user request.
  • Burst concurrency — webhooks and cron ticks align; hundreds of runs start within the same second.
  • Heterogeneous shapes — system prompts differ by tenant; context lengths span 2K–48K tokens.
  • Mixed SLAs — live chat needs sub-second first token; batch digest jobs tolerate seconds of wait.

Calling a hosted API with per-request pricing masks some inefficiency — you pay per token, not per idle GPU cycle. Self-hosted fleets and reserved capacity expose it immediately: low utilization with high queue depth means you are paying for silicon that sits empty while agents wait in serial lanes. Even on APIs, aggressive coalescing can reduce rate-limit pressure and improve tail latency when the provider batches internally but rewards steady ingress.

Three-layer scheduler architecture

Production coalescing stacks usually separate concerns:

1. Ingress admission queue

Accepts InferenceJob records from agent workers: run_id, step_id, model_id, priority, max_wait_ms, prompt_hash, token_budget, streaming flag, tenant_id. Admission applies tenant quotas and rejects jobs that would exceed per-run token ceilings before they touch the GPU.

2. Coalescing scheduler

Holds jobs for a short coalescing window (typically 5–50 ms for interactive, up to 500 ms for batch tiers). Within the window it builds batches that share:

  • Same base model and quantization profile
  • Compatible attention backend (e.g. both support flash-attn v2)
  • Streaming policy — streaming and non-streaming jobs usually split
  • LoRA / adapter ID when using parameter-efficient fine-tunes

The scheduler emits BatchPlan objects: ordered job list, padded sequence lengths, cancellation hooks for jobs that exceed max_wait_ms.

3. Inference worker pool

Workers execute forward passes — often vLLM, TensorRT-LLM, or TGI with continuous batching (requests join and leave mid-batch as sequences complete). Workers stream token deltas back to the scheduler, which routes chunks to the correct run_id/step_id subscriber.

Agent worker                Coalescing scheduler           GPU worker
    |  InferenceJob(run_A)  ->  [window 20ms]  ->  BatchPlan[ A,B,C ]
    |  InferenceJob(run_B)  ->       |         ->  continuous batch
    |  InferenceJob(run_C)  ->       |         ->  stream demux -> SSE

Static, dynamic, and continuous batching

Mode How it works Agent fit Tradeoff
Static batch Collect N jobs, pad to max length, one forward pass Offline eval, nightly digest agents Simple; wastes compute on short jobs in long batches
Dynamic batch Regroup when a job finishes; refill slots Mixed-length tool-summary steps Better utilization; more scheduler complexity
Continuous batching Iteration-level scheduling; sequences enter/exit each decode step High-QPS interactive agent fleets Best GPU fill; requires inference engine support

Agent fleets almost always need continuous or dynamic batching for interactive tiers. Static batching remains useful for scheduled agents that process thousands of similar records (invoice extraction, log classification) where a 200–500 ms coalescing window is invisible to users.

Coalescing windows and SLA classes

The coalescing window is the central latency–throughput knob:

  • Too short (0–2 ms) — behaves like unbatched; GPUs starve under bursty agent load.
  • Balanced (10–30 ms) — typical for chat agents; often recovers 2–4× throughput with <30 ms added wait.
  • Aggressive (100–500 ms) — acceptable for async webhook agents; pairs with async job queues.

Define priority lanes so a digest batch cannot block escalation-classifier steps:

P0 interactive  max_wait=25ms   coalesce=8ms   preemption=allowed
P1 standard     max_wait=120ms  coalesce=25ms  preemption=none
P2 batch        max_wait=30s    coalesce=400ms preemption=none

Harbor Support mapped Zendesk SLA tags to lanes: enterprise P0, default P1, nightly re-summarization P2. P0 jobs that waited longer than max_wait_ms flushed as singleton batches rather than waiting for fill — preserving tail latency guarantees.

Agent-specific batching concerns

Heterogeneous prompts

Agents rarely share identical prompts. Batch anyway when the model weights and adapter match; padding handles length differences. Group by quantized model variant (FP8 vs BF16) — mixing precisions breaks kernel fusion. For tenants with unique system prompts, consider bucketing by prompt template ID so cache-friendly groups emerge naturally alongside prefix caching.

Multi-step runs and cancellation

When a user cancels a run, the scheduler must evict pending jobs from the coalescing buffer and abort in-flight sequences on the worker. Tie eviction to cancellation tokens propagated from the agent runtime so cancelled steps do not consume GPU slots.

Streaming demultiplexing

Continuous batching interleaves decode steps. The scheduler tags each token delta with (run_id, step_id, seq_index) before forwarding to SSE subscribers. Clients never see cross-talk even though the GPU processed one batched matmul.

Tool-loop backpressure

When tool calls dominate, model queue depth drops and batching helps less. Instrument model-active ratio (fraction of run time waiting on inference). Harbor Support discovered 41% of p95 latency was tool I/O, not batching — fixing CRM pagination mattered as much as GPU scheduling.

Harbor Support refactor

Harbor’s before state: agent workers called POST /v1/completions directly; no shared queue; HPA scaled pod count on CPU while GPU nodes stayed cold. After state:

  1. Introduced a dedicated inference gateway with admission control and tenant quotas.
  2. Deployed vLLM with continuous batching on a fixed GPU pool sized to 70% target utilization.
  3. Implemented 20 ms / 8 ms coalesce windows for P0/P1 lanes with max_wait enforcement.
  4. Attributed batch efficiency metrics per tenant in the cost ledger (tokens per GPU-second, queue wait ms).
  5. Kept P2 digest jobs on a separate low-priority worker group to avoid head-of-line blocking.

Results after four weeks: GPU utilization 19% → 68%, p95 model step latency 3.8 s → 0.9 s, inference cost per run $0.41 → $0.11, weekly budget-breach rate 34% → 4.8%. No change to agent prompts or tool schemas — pure scheduling and capacity alignment.

Decision table: batching vs adjacent techniques

Technique Primary win When batching is better When the alternative wins
Prompt caching Cheaper prefill on repeated prefixes Bursty diverse prompts, low prefix overlap Stable system prompts across thousands of runs
Parallel tool execution Shorter wall-clock when tools are independent Model-bound steps with synchronized GPU queue Tool-bound runs with rare model calls
Smaller / routed models Lower per-token cost on easy steps Already on smallest viable model but GPUs idle Quality-sensitive steps needing frontier models
Request coalescing Higher GPU utilization, lower tail latency under burst Self-hosted fleets, reserved capacity, rate-limit storms Single-digit QPS with strict per-request isolation needs

Mature fleets combine all four: cache static prefixes, route easy steps to small models, parallelize read tools, and coalesce what remains.

Common pitfalls

  • One global coalescing window — P2 batch jobs inflate P0 latency; split lanes.
  • Ignoring max_wait_ms — chasing fill rate turns coalescing into unbounded delay.
  • Mixing model variants in one batch — silent kernel fallback or hard failures.
  • Scaling HTTP pods instead of GPU schedulers — multiplies idle workers; fix the gateway.
  • No cancellation propagation — cancelled runs burn GPU on abandoned sequences.
  • Attributing savings only at invoice level — without per-run wait and batch-fill metrics, regressions hide inside averages.
  • Batching across tenant secret boundaries — shared KV cache leaks are catastrophic; enforce tenant-scoped worker pools when required.
  • Assuming hosted APIs batch for you — provider-side batching is opaque; client-side pacing and async batch APIs still help on bursty agent ingress.

Production checklist

  • Instrument queue depth, coalescing wait ms, batch size, and GPU utilization per model.
  • Define P0/P1/P2 lanes with explicit max_wait_ms and coalescing windows.
  • Enforce model+adapter+precision grouping in the scheduler.
  • Wire cancellation tokens from agent runtime through to inference eviction.
  • Demux streaming tokens by run_id before SSE delivery.
  • Record tokens per GPU-second and queue wait in the cost ledger.
  • Load-test Monday-morning burst patterns, not average QPS.
  • Separate batch-tier workers from interactive pools.
  • Alert when GPU utilization < 40% AND p95 queue wait > SLA.
  • Reconcile batching wins against tool-loop duration before scaling GPUs.

Key takeaways

  • Agent fleets multiply small model calls into GPU-starving bursts.
  • Coalescing schedulers batch compatible jobs within SLA-bounded windows.
  • Continuous batching maximizes utilization for interactive agent steps.
  • Priority lanes prevent batch-tier jobs from blocking live chat.
  • Harbor Support cut budget breaches from 34% to 4.8% with scheduling alone.

Related reading