Guide
LLM async batch API explained
Harbor Analytics' fraud-scoring backfill needed 1.8 million transaction summaries classified in one weekend. The team initially fanned out synchronous chat-completions calls through a worker pool — 340 workers, aggressive retry logic, and $14,200 in token spend before half the queue stalled on HTTP 429 rate limits. The refactor uploaded a single JSONL file to the provider's batch endpoint, received results within nine hours at a 50% input/output discount, and cut the bill to $6,100 with zero manual rate-limit babysitting. Latency was irrelevant: analysts needed scores by Monday morning, not millisecond TTFT.
Async batch APIs are the underrated lane of production LLM architecture. Major providers (OpenAI, Anthropic, Google) offer offline job queues that accept thousands of requests in one upload, process them within a completion window (often up to 24 hours), and return a results file — typically at sharply lower per-token rates than realtime endpoints. This guide covers when batch beats sync, JSONL job format, lifecycle and polling, partial-failure semantics, cost math vs inference serving, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.
Sync realtime vs async batch: a simple taxonomy
Most tutorials assume every LLM call is interactive chat. Production workloads split cleanly into latency-sensitive and throughput-sensitive buckets:
| Mode | Latency target | Typical use | Billing profile |
|---|---|---|---|
| Realtime sync / streaming | Sub-second to few seconds | Chat UIs, copilots, live agents | Full list price; rate limits bind |
| Provider async batch | Minutes to 24 hours | Backfills, evals, nightly reports, embeddings at scale | Discounted (often ~50%); separate quota pool |
| Self-hosted batch queue | Configurable SLA | On-prem vLLM/TGI overnight jobs, air-gapped | GPU amortization; you operate the queue |
Batch APIs are not a replacement for user-facing chat. They are how you run the 80% of token volume that has no human staring at a spinner — classification at ingest, synthetic eval sets, document summarization for search indexes, and embedding regeneration after a model upgrade.
How provider batch jobs work
Although surface APIs differ, the pattern is consistent across OpenAI Batch, Anthropic Message Batches, and Google's batch prediction endpoints:
- Build a JSONL input file — one JSON object per line. Each
line carries a stable
custom_id(your primary key), the HTTP method and path (e.g./v1/chat/completions), and the request body identical to what you would POST synchronously. - Upload the file via the Files API (purpose:
batch). - Create a batch job referencing the input file ID and a completion
window (OpenAI:
24h). - Poll or webhook until status is
completed,failed, orexpired. - Download the output JSONL — each line maps
custom_idto either aresponseobject or anerrorobject.
Example input line (OpenAI-style):
{"custom_id":"txn-88421","method":"POST","url":"/v1/chat/completions","body":{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Classify risk: ..."}],"max_tokens":64}}
The output line preserves custom_id so you can join results back to your
warehouse without relying on row order. Store the batch job ID and input file hash in
your job metadata table for audit replay.
Cost, quota, and throughput economics
Batch pricing is the main reason teams adopt the pattern. OpenAI and Anthropic publish batch rates at roughly half of synchronous inference for the same model tier. On a million-token nightly job, that is not a rounding error — it can fund a week of realtime traffic elsewhere.
- Separate rate-limit pools — batch jobs do not compete with your production chat quota the same way parallel sync workers do.
- No worker-pool overhead — you stop paying engineering time to tune concurrency, backoff, and 429 storms on backfills.
- Opportunity cost of delay — if results must land within five minutes, batch is wrong regardless of discount.
- Storage and orchestration — JSONL files, S3 staging, and job schedulers add minor infra cost; usually dwarfed by token savings above ~100k tokens per job.
Pair batch with cost optimization tactics: use smaller models in batch for classification, reserve frontier models for the 5% of rows that fail confidence thresholds on the cheap tier.
Partial failures, idempotency, and recovery
Batch jobs complete with partial success — some lines succeed, others return provider errors (context length exceeded, content policy, transient 5xx wrapped per line). Your pipeline must handle this explicitly:
- Per-line error objects — never assume all-or-nothing; parse
both
responseanderrorbranches. - Idempotent custom_id — use deterministic IDs
(
doc_id + prompt_version) so re-submitted recovery batches do not duplicate downstream writes. - Failed-line requeue — extract errored lines into a smaller JSONL and submit a child batch; cap recursion depth.
- Expired batches — if the completion window passes before processing finishes, the job may expire; monitor and resubmit unprocessed IDs.
- Validation before upload — schema-check JSONL locally; one malformed line can fail file ingestion.
Log batch job ID, line counts (submitted / succeeded / failed), and wall-clock duration in your observability stack. Alert when failure rate exceeds baseline — often a prompt or model deprecation issue, not random noise.
Harbor Analytics refactor (worked example)
Before batch adoption, Harbor's weekend backfill used Celery workers calling sync completions with token-bucket rate limiting shared with the live fraud dashboard. Dashboard latency spiked whenever a backfill ran.
After refactor:
- Friday 18:00 UTC — Spark job exports pending transactions to JSONL on object storage (4.2 GB, 1.8M lines).
- Validation step checks schema, token estimates, and dedupes by
txn_id:v3custom_id. - Batch job created; webhook on completion posts to an internal queue.
- Saturday morning — output JSONL streamed into BigQuery via load job; failed lines (0.3%) requeued in a 12k-line child batch.
- Monday 07:00 — analysts see refreshed scores; live API quota untouched.
Key design choice: classification prompt pinned to prompt_hash=9f2a in the
custom_id so a mid-week prompt change does not silently mix scoring versions in one
batch.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Provider async batch API | Large offline jobs, 50% cost savings, SLA hours not seconds | Interactive UX, tool loops needing immediate follow-ups |
| Sync + worker pool | Medium jobs needing results within minutes, provider has no batch tier | Millions of rows; competes with production rate limits |
| Self-hosted continuous batching (vLLM) | Data residency, custom models, predictable GPU ownership | Small sporadic jobs where GPU idle time dominates cost |
| Streaming realtime API | Chat, copilots, agent turns with human in the loop | Overnight ETL where nobody reads tokens live |
| Cached / rules fallback | High-repeat queries with stable answers | Novel per-row content like transaction narratives |
| Map-reduce chunking in batch | Summarizing corpora longer than context window | Simple single-shot classification per row |
Common pitfalls
- Treating batch like sync with sleep — polling every second wastes API calls; use webhooks or exponential poll intervals.
- Unstable custom_id — random UUIDs per submit make recovery joins impossible.
- Mixing prompt versions in one job — embed version in ID or metadata; downstream metrics become meaningless.
- Ignoring per-line errors — assuming 200 on batch status means every row succeeded.
- Oversized single files — split into chunks (e.g. 50k lines) for easier retry and parallel batch jobs.
- Missing deadline buffer — submitting at hour 23 of a 24h window risks expiration on provider backlog.
- PII in plaintext JSONL — encrypt at rest, restrict file ACLs, delete inputs after successful load.
- No cost cap — estimate tokens before upload; abort if projection exceeds budget.
Production checklist
- Classify workloads: interactive (sync) vs deferrable (batch) before writing code.
- Use deterministic
custom_idincluding entity ID and prompt/model version. - Validate JSONL schema and token estimates locally before upload.
- Split very large jobs into shard files with independent batch IDs.
- Configure webhook or polled status handler with idempotent processing.
- Parse per-line success and error; requeue failures into child batches.
- Log batch ID, counts, duration, and token spend to observability backend.
- Alert on failure-rate spikes and expired batch jobs.
- Isolate batch quota from realtime production traffic.
- Delete input/output files from provider storage after durable ingest.
- Document SLA: “results within N hours” for stakeholder expectations.
Key takeaways
- Most token volume in mature LLM products is offline — batch APIs exist for that shape.
- ~50% cost discounts and separate quotas often beat hand-tuned sync worker pools on backfills.
- JSONL + custom_id is your join key — design it for idempotent recovery, not convenience.
- Partial per-line failure is normal — plan child batches and metrics, not all-or-nothing assumptions.
- Keep realtime endpoints for humans; route everything else through batch or self-hosted overnight queues.
Related reading
- LLM cost optimization explained — token budgets, model cascades, and when batch discounts compound
- LLM inference serving explained — continuous batching on your own GPUs vs provider async jobs
- LLM retry, fallback and resilience explained — when sync paths still need backoff and circuit breakers
- LLM observability explained — trace batch jobs, per-line outcomes, and cost attribution