Guide

LLM async batch API explained

Harbor Analytics' fraud-scoring backfill needed 1.8 million transaction summaries classified in one weekend. The team initially fanned out synchronous chat-completions calls through a worker pool — 340 workers, aggressive retry logic, and $14,200 in token spend before half the queue stalled on HTTP 429 rate limits. The refactor uploaded a single JSONL file to the provider's batch endpoint, received results within nine hours at a 50% input/output discount, and cut the bill to $6,100 with zero manual rate-limit babysitting. Latency was irrelevant: analysts needed scores by Monday morning, not millisecond TTFT.

Async batch APIs are the underrated lane of production LLM architecture. Major providers (OpenAI, Anthropic, Google) offer offline job queues that accept thousands of requests in one upload, process them within a completion window (often up to 24 hours), and return a results file — typically at sharply lower per-token rates than realtime endpoints. This guide covers when batch beats sync, JSONL job format, lifecycle and polling, partial-failure semantics, cost math vs inference serving, the Harbor Analytics refactor, a technique decision table, pitfalls, and a production checklist.

Sync realtime vs async batch: a simple taxonomy

Most tutorials assume every LLM call is interactive chat. Production workloads split cleanly into latency-sensitive and throughput-sensitive buckets:

Mode	Latency target	Typical use	Billing profile
Realtime sync / streaming	Sub-second to few seconds	Chat UIs, copilots, live agents	Full list price; rate limits bind
Provider async batch	Minutes to 24 hours	Backfills, evals, nightly reports, embeddings at scale	Discounted (often ~50%); separate quota pool
Self-hosted batch queue	Configurable SLA	On-prem vLLM/TGI overnight jobs, air-gapped	GPU amortization; you operate the queue

Batch APIs are not a replacement for user-facing chat. They are how you run the 80% of token volume that has no human staring at a spinner — classification at ingest, synthetic eval sets, document summarization for search indexes, and embedding regeneration after a model upgrade.

How provider batch jobs work

Although surface APIs differ, the pattern is consistent across OpenAI Batch, Anthropic Message Batches, and Google's batch prediction endpoints:

Build a JSONL input file — one JSON object per line. Each line carries a stable custom_id (your primary key), the HTTP method and path (e.g. /v1/chat/completions), and the request body identical to what you would POST synchronously.
Upload the file via the Files API (purpose: batch).
Create a batch job referencing the input file ID and a completion window (OpenAI: 24h).
Poll or webhook until status is completed, failed, or expired.
Download the output JSONL — each line maps custom_id to either a response object or an error object.

Example input line (OpenAI-style):

{"custom_id":"txn-88421","method":"POST","url":"/v1/chat/completions","body":{"model":"gpt-4o-mini","messages":[{"role":"user","content":"Classify risk: ..."}],"max_tokens":64}}

The output line preserves custom_id so you can join results back to your warehouse without relying on row order. Store the batch job ID and input file hash in your job metadata table for audit replay.

Cost, quota, and throughput economics

Batch pricing is the main reason teams adopt the pattern. OpenAI and Anthropic publish batch rates at roughly half of synchronous inference for the same model tier. On a million-token nightly job, that is not a rounding error — it can fund a week of realtime traffic elsewhere.

Separate rate-limit pools — batch jobs do not compete with your production chat quota the same way parallel sync workers do.
No worker-pool overhead — you stop paying engineering time to tune concurrency, backoff, and 429 storms on backfills.
Opportunity cost of delay — if results must land within five minutes, batch is wrong regardless of discount.
Storage and orchestration — JSONL files, S3 staging, and job schedulers add minor infra cost; usually dwarfed by token savings above ~100k tokens per job.

Pair batch with cost optimization tactics: use smaller models in batch for classification, reserve frontier models for the 5% of rows that fail confidence thresholds on the cheap tier.

Partial failures, idempotency, and recovery

Batch jobs complete with partial success — some lines succeed, others return provider errors (context length exceeded, content policy, transient 5xx wrapped per line). Your pipeline must handle this explicitly:

Per-line error objects — never assume all-or-nothing; parse both response and error branches.
Idempotent custom_id — use deterministic IDs (doc_id + prompt_version) so re-submitted recovery batches do not duplicate downstream writes.
Failed-line requeue — extract errored lines into a smaller JSONL and submit a child batch; cap recursion depth.
Expired batches — if the completion window passes before processing finishes, the job may expire; monitor and resubmit unprocessed IDs.
Validation before upload — schema-check JSONL locally; one malformed line can fail file ingestion.

Log batch job ID, line counts (submitted / succeeded / failed), and wall-clock duration in your observability stack. Alert when failure rate exceeds baseline — often a prompt or model deprecation issue, not random noise.

Harbor Analytics refactor (worked example)

Before batch adoption, Harbor's weekend backfill used Celery workers calling sync completions with token-bucket rate limiting shared with the live fraud dashboard. Dashboard latency spiked whenever a backfill ran.

After refactor:

Friday 18:00 UTC — Spark job exports pending transactions to JSONL on object storage (4.2 GB, 1.8M lines).
Validation step checks schema, token estimates, and dedupes by txn_id:v3 custom_id.
Batch job created; webhook on completion posts to an internal queue.
Saturday morning — output JSONL streamed into BigQuery via load job; failed lines (0.3%) requeued in a 12k-line child batch.
Monday 07:00 — analysts see refreshed scores; live API quota untouched.

Key design choice: classification prompt pinned to prompt_hash=9f2a in the custom_id so a mid-week prompt change does not silently mix scoring versions in one batch.

Technique decision table

Approach	Best when	Weak when
Provider async batch API	Large offline jobs, 50% cost savings, SLA hours not seconds	Interactive UX, tool loops needing immediate follow-ups
Sync + worker pool	Medium jobs needing results within minutes, provider has no batch tier	Millions of rows; competes with production rate limits
Self-hosted continuous batching (vLLM)	Data residency, custom models, predictable GPU ownership	Small sporadic jobs where GPU idle time dominates cost
Streaming realtime API	Chat, copilots, agent turns with human in the loop	Overnight ETL where nobody reads tokens live
Cached / rules fallback	High-repeat queries with stable answers	Novel per-row content like transaction narratives
Map-reduce chunking in batch	Summarizing corpora longer than context window	Simple single-shot classification per row

Common pitfalls

Treating batch like sync with sleep — polling every second wastes API calls; use webhooks or exponential poll intervals.
Unstable custom_id — random UUIDs per submit make recovery joins impossible.
Mixing prompt versions in one job — embed version in ID or metadata; downstream metrics become meaningless.
Ignoring per-line errors — assuming 200 on batch status means every row succeeded.
Oversized single files — split into chunks (e.g. 50k lines) for easier retry and parallel batch jobs.
Missing deadline buffer — submitting at hour 23 of a 24h window risks expiration on provider backlog.
PII in plaintext JSONL — encrypt at rest, restrict file ACLs, delete inputs after successful load.
No cost cap — estimate tokens before upload; abort if projection exceeds budget.

Production checklist

Classify workloads: interactive (sync) vs deferrable (batch) before writing code.
Use deterministic custom_id including entity ID and prompt/model version.
Validate JSONL schema and token estimates locally before upload.
Split very large jobs into shard files with independent batch IDs.
Configure webhook or polled status handler with idempotent processing.
Parse per-line success and error; requeue failures into child batches.
Log batch ID, counts, duration, and token spend to observability backend.
Alert on failure-rate spikes and expired batch jobs.
Isolate batch quota from realtime production traffic.
Delete input/output files from provider storage after durable ingest.
Document SLA: “results within N hours” for stakeholder expectations.

Key takeaways

Most token volume in mature LLM products is offline — batch APIs exist for that shape.
~50% cost discounts and separate quotas often beat hand-tuned sync worker pools on backfills.
JSONL + custom_id is your join key — design it for idempotent recovery, not convenience.
Partial per-line failure is normal — plan child batches and metrics, not all-or-nothing assumptions.
Keep realtime endpoints for humans; route everything else through batch or self-hosted overnight queues.