Guide
LLM parallel tool calling explained
Harbor Support's refund triage agent answered “Where is my order and can I
get a refund?” by calling get_order, waiting 2.8 seconds, then
get_payment, then get_shipment_status, then
check_refund_eligibility — four round trips through the model and
four sequential HTTP calls. Median ticket latency hit 9.2 seconds. Customers abandoned
chat; CSAT dropped 14 points. The model was fine. The orchestrator treated every tool
as if the next one depended on the last, when three of four lookups only needed the
order ID already parsed from the user message.
Parallel tool calling lets an agent issue multiple function calls in one model turn and execute them concurrently when inputs do not depend on each other's outputs. Modern APIs expose this natively; even without it, your runtime can batch independent calls between ReAct steps. This guide covers independence taxonomy, provider capabilities, DAG-based scheduling, bounded concurrency and rate limits, merging partial results back into observations, integration with tool error handling, the Harbor Support refactor, a technique decision table versus serial ReAct loops, pitfalls, and a production checklist.
Independence taxonomy
Not every multi-tool turn should run in parallel. Classify calls before you fan out.
| Class | Definition | Execution |
|---|---|---|
| Fully independent | All arguments known before any tool runs | Parallel — e.g. fetch weather in three cities, load three order IDs |
| Read-only fan-out | Same parent ID, different endpoints | Parallel reads — order + payment + shipment by order_id |
| Soft dependency | Later call benefits from earlier data but has a default | Parallel with fallback args, or speculatively parallel then discard |
| Hard dependency | Later call requires output of earlier call | Serial — create_refund needs payment_id from get_payment |
| Write chain | Mutating calls where order matters | Serial with idempotency keys — never parallelize unguarded writes |
A common mistake is serializing read-only fan-out because the ReAct template shows one action per thought. The loop is a control pattern, not a latency requirement.
Provider and runtime capabilities
Native parallel function calling
OpenAI-compatible APIs allow the model to return multiple tool_calls in
a single assistant message. Your runtime executes them and submits all
tool role messages before the next completion. Anthropic's Messages
API similarly supports multiple tool_use blocks per turn. Enable this in
your SDK — do not truncate to the first call unless policy requires it.
Orchestrator-side batching
Even when the model emits one call per turn, your planner can detect patterns:
if the model requests get_order and the system prompt says refunds need
payment and shipment data, prefetch the sibling reads while the model “thinks.”
Plan-and-execute
architectures often emit a structured step list where independent steps are
explicitly marked parallel_group.
Static DAG from intent
For high-volume flows (refund triage, KYC checks), hard-code a DAG: known
independent nodes run under asyncio.gather or Promise.all
with a concurrency cap. The LLM fills slot values; the graph defines topology.
Reduces token spend and eliminates an entire model turn.
Scheduling and concurrency limits
Unbounded parallelism creates new failures:
- Downstream rate limits — five parallel calls can trigger 429s that serial execution would not. Use per-tenant token buckets and respect quota policies on both LLM and tool APIs.
- Connection pool exhaustion — HTTP clients default to small pools; raise max connections or use a dedicated pool per tool service.
- Thundering herd on cold caches — parallel identical reads are fine; parallel writes to the same aggregate root are not.
- Timeout composition — parallel latency is
max(t_i)notsum(t_i), but tail latency follows the slowest sibling. Set per-call deadlines and cancel siblings only when the user-facing answer cannot be partial.
Practical default: concurrency limit of 3–8 for read fan-out, 1 for writes unless tools are provably idempotent and target different resources. Log p50/p95 per tool and per parallel batch size.
Merging observations
After parallel execution, the model needs a coherent observation block:
- Preserve call identity — each result maps to
tool_call_id(OpenAI) ortool_use_id(Anthropic). Never reorder or merge JSON blobs ambiguously. - Structured envelopes — wrap successes and failures in a
consistent schema:
{ "status": "ok"|"error", "data": ..., "error_code": ... }. Same pattern as tool error handling. - Partial success policy — if two of three reads succeed, pass all three observations; let the model ask the user or retry failed legs. Do not fail the whole batch silently.
- Token budget — parallel calls can return large payloads. Truncate or summarize per tool before inserting into context; link to full records by ID when needed.
For human-readable logs, emit a single “batch summary” line in observability traces: batch_id, tools, durations, outcome counts.
When to stay serial
Parallelism is not free complexity. Prefer serial ReAct when:
- The next tool name or arguments genuinely depend on parsing the previous observation.
- Tools are mutating and lack idempotency keys (payments, inventory decrements).
- Debugging cost outweighs latency savings — early prototypes, low-volume internal tools.
- The model confuses parallel results — smaller models sometimes hallucinate cross-tool joins; A/B test before enabling native multi-call.
- Compliance requires step-by-step audit trails with human-readable reasoning between writes.
Harbor Support triage refactor
Harbor Support rebuilt the refund intake path:
- Enabled native multi-
tool_callcompletions on GPT-4o-class models; system prompt explicitly allows multiple reads per turn when order ID is known. - Added a static DAG for “order status + refund eligibility” intents:
get_order,get_payment,get_shipment_statusrun inparallel_group: refund_contextwith concurrency 3. - Introduced per-tool 2.5s deadlines; slow shipment API no longer blocks payment data
from reaching the model (partial observation with
status: timeout). - Structured observation envelopes unified with the existing
error taxonomy
— partial batch failures surface as retriable
TOOL_TIMEOUTnot raw HTML. - Dashboards split “model latency” vs “tool batch latency”; caught a regression when a deploy accidentally serialized reads again.
Median chat latency fell from 9.2s to 2.1s; CSAT recovered within two weeks. Token use dropped ~18% because the model skipped two intermediate turns that only existed to request the next read.
Technique decision table
| Scenario | Serial ReAct | Parallel tool calling |
|---|---|---|
| Multi-read context gathering (order + payment + ship) | Simple but slow | Preferred — large latency win |
| Chained writes (create ticket then attach file) | Required | Do not parallelize |
| Exploratory agent with unknown next step | Natural fit | Risky — model may over-call |
| High-QPS support bot with fixed intents | Wastes turns | Preferred — DAG + parallel reads |
| Rate-limited partner API (strict 1 RPS) | Safer default | Only with explicit throttle queue |
| Small local model, weak multi-call parsing | More reliable | Test carefully; orchestrator batching may beat native |
Common pitfalls
- Parallelizing dependent calls. Passing empty
payment_idbecauseget_paymenthas not finished creates subtle bugs. - Ignoring partial failures. One 500 should not poison the whole batch without explicit error objects per call.
- Duplicate writes. Two parallel
create_refundcalls double-refund; use idempotency keys and serialize mutators. - Observation ordering bugs. Mismatched
tool_call_idmapping makes the model cite wrong data. - Unbounded fan-out. “Check all 200 SKUs” parallelized melts inventory API; batch or paginate.
- Latency illusion in traces. Parallel tool time hides inside one model turn — instrument batch duration separately.
Production checklist
- Independence classifier or DAG spec before each multi-tool batch.
- Native multi-
tool_callenabled where the model supports it. - Concurrency cap and per-tool rate limits configured.
- Per-call timeouts with structured timeout observations.
- Partial-success policy documented and tested.
tool_call_idpreserved through execution and logging.- Idempotency keys on all parallel-safe writes; writes never unguarded in parallel.
- Observation payload size limits or summarization per tool.
- Traces include batch_id, tool list, individual durations, outcome counts.
- A/B or canary on latency and error rate when enabling parallelism.
- Integration tests for 0/1/N success combinations in a batch.
- Runbook for disabling parallel mode via feature flag under incident.
Key takeaways
- Most agent slowness is serial I/O, not model inference. Parallel reads are the cheapest latency win.
- Classify independence before you fan out. Hard dependencies and write chains stay serial.
- Providers already support multi-call turns. Your orchestrator must execute and observe correctly.
- Partial failure is normal. Per-tool envelopes beat all-or-nothing batches.
- Measure batch latency separately. Otherwise regressions hide inside “one turn.”
Related reading
- LLM function calling explained — schemas, APIs and tool registration
- LLM ReAct agent loop explained — thought-action-observation control flow
- LLM tool error handling explained — structured failures and retries
- LLM plan-and-execute explained — planners that emit parallel step groups