Guide

LLM agent parallel tool execution explained

Harbor DevOps shipped an on-call incident triage agent that, on every page, called eight read-only integrations in strict sequence: PagerDuty incident details, Datadog metric snapshots, recent deploy events, feature-flag state, error-budget burn, linked Jira tickets, Slack thread excerpts, and a runbook lookup. Each call averaged 5–7 seconds. The model often emitted all eight tool requests in a single turn — but the runtime executed them one after another. Median step latency was 47 seconds; on-call engineers clicked away before the summary arrived. SLA target was 30 seconds.

Parallel tool execution is the orchestration layer that decides which tool calls from one model turn can run concurrently, which must wait on predecessors, and how to merge partial failures back into the conversation. It sits between function calling (what the model requests) and tool error handling (what happens when a call fails). Harbor replaced serial execution with a dependency-aware batch executor, per-integration concurrency caps, and structured partial-result envelopes. p95 step latency fell to 12 seconds with the same model and prompts; timeout-driven abandon rate dropped from 23% to 4%. This guide covers when parallelism is safe, DAG scheduling, provider APIs, rate-limit interaction, cancellation propagation, the Harbor DevOps refactor, a technique decision table, pitfalls, and a production checklist.

Serial vs parallel in one model turn

Modern chat APIs let the model return multiple tool calls in a single assistant message. That is not parallelism by itself — your runtime must choose execution strategy:

  • Serial — execute call A, append result, execute B. Simple, predictable, and mandatory when B's arguments depend on A's output.
  • Parallel batch — execute independent calls concurrently, wait for the slowest, return all observations in one tool-result message batch.
  • Staged waves — run wave 1 in parallel, merge results, run wave 2 that depends on wave 1, repeat until the DAG is drained.

Harbor's bug was treating “multiple tool calls in one turn” as implicitly parallel at the API level while implementing a serial for-loop in the worker. The model correctly assumed independent reads could overlap; the platform wasted 35+ seconds per turn on wall-clock I/O that could have been hidden behind a single wait.

Read vs write safety gates

Not every independent-looking call is safe to parallelize. Classify tools before enabling batch mode:

Usually safe in parallel

  • Idempotent GET-style reads (metrics, search, document fetch).
  • Read-only SQL SELECT against stable snapshots.
  • Embedding or classification calls with no shared mutable state.

Require ordering or exclusivity

  • Creates followed by updates on the same resource ID.
  • Balance checks before debits; inventory reserve before ship.
  • File writes to the same path; migration DDL in sequence.
  • Any tool marked mutating: true without idempotency keys.

Harbor tags every registered tool with side_effect: none | read | write and resource_key (e.g. order:{id}). The scheduler builds conflict sets: two writes to the same key never overlap; a write never runs parallel to anything touching the same key. Reads with distinct keys fan out freely up to the concurrency cap.

Dependency DAG scheduling

When the model emits calls with explicit or inferable dependencies, treat the turn as a directed acyclic graph:

  1. Parse tool calls and extract argument references to prior call outputs (some frameworks embed $call_1.order_id placeholders).
  2. Topological sort into waves; each wave runs in parallel.
  3. Inject merged observations before the next wave or before the next model turn.

If the model emits a cycle (call B references A, call A references B), reject the batch and return a structured error asking for a serial plan — do not guess execution order on mutating tools.

For Harbor triage, wave 1 parallelized all eight reads. A follow-up turn sometimes called post_slack_summary (write) only after wave 1 completed — enforced by the write gate, not by hope.

Provider APIs and runtime hooks

OpenAI-compatible APIs accept multiple tool_calls per assistant message; you respond with a list of tool role messages, one per call ID. Anthropic tool-use blocks follow the same batch pattern. The provider does not execute tools for you — parallelism is entirely in your executor.

Implementation sketch:

  • Executor pool — asyncio gather, thread pool for sync SDKs, or a worker queue per integration.
  • Per-tool semaphores — cap concurrent calls to Stripe, GitHub, or internal DB (Harbor: Datadog max 3, Jira max 2).
  • Global turn budget — wall-clock cap for the whole batch; cancel pending futures when exceeded (ties into cancellation lifecycle).
  • Trace spans — one parent span per model turn, child spans per tool; overlap visible in traces.

Partial failure and result merging

Parallel execution changes failure modes. In serial mode, the agent often stops at the first error. In parallel mode, three of five calls may succeed while two time out. Policy choices:

  • Fail-fast — cancel siblings on first hard failure. Use for tightly coupled financial steps.
  • Best-effort — return per-call status envelopes; let the model summarize with gaps. Harbor triage uses this for reads.
  • Retry subset — re-queue only failed calls with backoff; cap retries per turn to avoid doubling latency.

Return JSON observations that include ok, error_code, and latency_ms per call so the model does not hallucinate success. Pair with rate-limit queues when parallel bursts trigger 429s from upstream APIs.

Harbor DevOps refactor walkthrough

The team changed four layers without touching prompts:

  1. Tool registry metadata — side-effect class and resource keys on all 42 tools.
  2. Batch executor — default parallel for side_effect: read with distinct keys; serial otherwise.
  3. Per-integration caps — stopped Datadog 429 storms when the model requested twelve metric queries at once.
  4. Observation envelope — unified multi-call response schema; partial failures surfaced explicitly.

p50 step latency: 47 s → 11 s. p95: 61 s → 12 s. Token usage unchanged. Cost attribution showed tool-time dominated runs — parallel I/O was the lever, not a larger model.

Technique decision table

Strategy Best for Weak when Harbor-style signal
Serial execution Mutating chains, debugging, low call count Many slow read integrations per turn Step latency > sum of tool p95s
Full parallel batch Independent reads, enrichment gathers Shared resource keys or rate limits 8+ read calls, <3 s CPU work
Staged DAG waves Mixed read-then-write flows Model emits undeclared dependencies Placeholder args reference prior calls
Speculative parallel Predictable next reads (cache warming) Write tools or costly billed APIs Rare; Harbor disabled after cost spike
Provider-native batch APIs Embedding batches, bulk export jobs Interactive chat latency targets Async job + poll pattern

Common pitfalls

  • Parallelizing writes with hidden dependencies — double charges, duplicate tickets, race-corrupted files.
  • No per-integration caps — parallel burst trips vendor 429s; slower than serial with retries.
  • Ignoring cancellation — user stops chat but six HTTP requests keep running.
  • Unbounded observation size — five parallel 50 KB JSON blobs blow the context budget in one turn.
  • Assuming the model knows execution order — document parallelism in tool descriptions when reads are safe to batch.
  • Fail-fast on optional reads — one flaky metrics API blocks the whole triage summary.
  • Missing idempotency on retried parallel writes — retry storms after partial batch failure create duplicates.
  • No trace overlap visibility — teams optimize LLM latency while 80% of turn time is serial tool I/O.

Engineer checklist

  • Tag every tool with side-effect class and optional resource key.
  • Default parallel execution for independent reads; opt-in serial for writes.
  • Implement DAG waves when arguments reference prior call outputs.
  • Set per-tool and per-integration concurrency semaphores.
  • Attach a wall-clock budget per turn; propagate cancel to in-flight calls.
  • Return structured per-call status in batched tool results.
  • Choose fail-fast vs best-effort per tool category.
  • Log parallel fan-out and max overlap in traces.
  • Load-test parallel bursts against vendor rate limits.
  • Compress or truncate large parallel observations before the next model call.
  • Document in tool schemas when parallel calls are encouraged.
  • Re-benchmark step latency after adding new integrations.

Key takeaways

  • Multiple tool calls per turn are not parallel until your runtime makes them parallel.
  • Classify reads vs writes and resource keys before fan-out.
  • DAG waves handle read-then-write flows without giving up parallelism on independent steps.
  • Partial failure policies matter as much as happy-path latency.
  • Harbor cut p95 step latency 47 s → 12 s with executor changes alone — measure tool overlap in traces before buying a bigger model.

Related reading