Guide

LLM agent tool error handling and partial failure recovery explained

Harbor Integrations ships a CRM sync agent that updates contacts, creates tasks, and posts activity notes across Salesforce and HubSpot in a single run. The first production build passed raw HTTP status codes and Java stack traces back to the model as tool “results.” When three of five parallel writes failed with 403 Forbidden, the model summarized “sync complete” because two calls succeeded and the error blobs looked like noise. 34% of audited runs left customer records half-updated with no alert — support tickets blamed “the AI lied.” After engineering introduced structured tool error envelopes, explicit partial-batch semantics, and a recovery policy matrix coordinated with retry budgets and compensating rollbacks, silent-failure rate fell to 2.4% while mean steps-to-recovery dropped from 4.8 to 1.6.

Tool error handling in agent systems is not the same problem as HTTP retry or circuit breaking. Those layers decide whether to re-execute a call. This layer decides what the model sees after a tool returns — success, partial success, retriable failure, permanent failure, or ambiguous timeout — and which recovery actions the runtime may take without the model hallucinating a clean outcome. This guide covers error envelopes, batch semantics, injection into the observation stream, recovery policy selection, Harbor Integrations’ refactor, a technique decision table, pitfalls, and a production checklist.

Why tool errors differ from model and transport errors

Three failure planes stack in every agent run, and conflating them causes the silent-failure pattern Harbor Integrations hit:

  • Transport errors — TCP reset, 503, rate limit. Handled by retry and backoff before the model ever sees a result.
  • Tool execution errors — valid HTTP 200 with { "error": "contact_locked" }, or 404 because the record was deleted. The call “succeeded” at the wire layer but failed at the business layer.
  • Model reasoning errors — the model misreads a well-formed error envelope and claims success anyway. Caught by output validation and structured final-answer contracts.

Tool error handling sits squarely in the middle plane. Its job is to normalize every tool outcome into a schema the model can reason about, attach enough context for replanning, and prevent optimistic synthesis when any required step failed.

The structured error envelope

Never pass raw stack traces or empty strings to the model. Production agents wrap every tool result — success or failure — in a consistent envelope:

  • status — one of ok, partial, error, timeout, cancelled.
  • error_code — stable machine identifier (CONTACT_LOCKED, QUOTA_EXCEEDED) mapped from vendor codes; never rely on the model parsing HTTP status.
  • retriable — boolean hint so the model and runtime agree on whether another attempt is worthwhile.
  • message — one-line human summary for the model; scrub secrets and PII before injection.
  • data — payload on success or partial success; omitted or null on hard failure.
  • metadata — tool name, call ID, latency, idempotency key, affected entity IDs for saga tracking.

On partial, include both succeeded and failed sub-operations in data with per-item status. The model must see “3 of 5 contacts updated; 2 failed with CONTACT_LOCKED” not a blended JSON array where failures are indistinguishable from empty results.

Partial failure in parallel and sequential batches

Agents increasingly fire parallel tool calls. Without explicit batch semantics, partial success is ambiguous:

Parallel batch policies

  • All-or-nothing — if any required tool fails, mark the whole batch error and trigger rollback of committed siblings. Use when operations must be atomic from the customer’s perspective.
  • Best-effort — return partial with per-item status; model decides whether to retry failed items or notify the user. Use for bulk enrichment where incomplete is acceptable if disclosed.
  • Fail-fast — cancel in-flight siblings on first hard failure; return immediately. Saves quota when later steps depend on the first result.

Sequential dependency chains

When tool B depends on tool A’s output, a failure at A should short-circuit B with a synthetic skipped_dependency_failed envelope rather than letting the model call B with null inputs and produce nonsense. Log the skip in tracing as a distinct span event.

Recovery policy matrix: retry, skip, escalate, rollback

After normalizing the error, the runtime — not always the model — selects a recovery action:

Condition Runtime action Model role
Transient + retriable + budget remaining Auto-retry with backoff; inject final outcome only None until retry exhausted
Permanent business error (404, validation) Inject error envelope; no retry Replan alternate path or ask user
Partial batch, best-effort policy Inject partial envelope with item manifest Retry failed items or summarize gaps
Side effect committed, sibling failed Trigger compensating transaction Explain rollback to user if visible
High-risk tool (payments, deletes) Pause run; route to HITL queue Wait for approval packet
Repeated identical failure (poison) Stop loop; send to dead letter queue None — run terminated

Encode this matrix in middleware so every tool adapter shares the same behavior. Letting each tool author return ad-hoc strings guarantees inconsistent recovery and untestable agent trajectories.

Injecting errors into the observation stream

How you present failures to the model shapes whether it recovers or hallucinates:

  • Always include failure count in the system reminder when any tool in the last turn returned error or partial — e.g. “2 tools failed; you must not claim full success.”
  • Cap error detail length — truncate vendor payloads through summarization but preserve error_code and entity IDs.
  • Distinguish tool-not-called from tool-failed — models conflate “no data returned” with “empty success.”
  • Surface timeout ambiguity — when outcome is unknown, status is timeout with retriable: true and an idempotency probe hint, not error.

Harbor Integrations added a post-tool middleware hook that appends a one-line run_health block after every tool round: { "tools_ok": 2, "tools_failed": 1, "blocking_failure": true }. Final-answer guardrails reject any user-facing message containing “complete” or “success” when blocking_failure is true.

Harbor Integrations refactor walkthrough

The team shipped four changes over two sprints:

  1. Adapter envelope — every CRM connector returns the standard schema; vendor SDK exceptions are caught and mapped to error_code at the adapter boundary.
  2. Batch manifest — parallel update_contacts calls return partial with a per-ID array; all-or-nothing mode wraps the batch in a saga with undo hooks.
  3. Recovery middleware — implements the policy matrix; auto-retries only retriable: true within per-step budget before the model sees the failure.
  4. Output guardrail — JSON schema on final response requires sync_status enum matching actual tool outcomes; mismatch triggers one self-correction turn, then HITL.

Silent failures dropped from 34% to 2.4% on a 500-run audit set. Remaining failures were genuine ambiguous timeouts where idempotency probes were not yet implemented — tracked as phase two.

Technique decision table

Approach Strengths Weaknesses Best for
Raw error strings to model Fast to prototype Silent failures, token waste, no policy Local demos only
Structured error envelope Consistent replanning, testable Adapter work per integration All production agents with tools
Runtime-only recovery (no model) Deterministic, fast Cannot handle novel failures Transient retries, rollbacks
Model-driven recovery Flexible alternate paths Hallucination risk without guardrails Business-logic failures after envelope injection
Fail entire run on any error Simple correctness story Poor UX, wasted successful work Financial transactions, safety-critical

Common pitfalls

  • Success on empty — tool returns {} on 404; model treats as valid data. Enforce non-empty success contracts.
  • Swallowed parallel failures — Promise.all without per-item catch loses which call failed. Use allSettled semantics.
  • Retry at two layers — HTTP client and agent runtime both retry, doubling side effects. Centralize retry ownership.
  • Error text as training leakage — vendor responses contain PII; scrub before model injection and trace export.
  • No blocking-failure flag — model writes cheerful summary over failed required steps. Add run_health and guardrails.
  • Timeout treated as permanent — duplicate commits on retry. Use idempotency keys and timeout status.
  • Infinite replan loops — model retries same invalid input forever. Cap recovery turns; escalate to DLQ.

Production checklist

  • Define standard tool result envelope (status, error_code, retriable, data, metadata).
  • Map every vendor exception to stable error_code at adapter boundary.
  • Choose batch policy per tool group: all-or-nothing, best-effort, or fail-fast.
  • Implement short-circuit for sequential dependency failures.
  • Centralize recovery policy matrix in middleware, not per-tool.
  • Auto-retry only retriable errors within per-step budget before model observation.
  • Inject run_health summary after each tool round with blocking_failure flag.
  • Validate final user message against actual tool outcome manifest.
  • Wire compensating rollbacks for atomic multi-write batches.
  • Route poison patterns (same error_code N times) to dead letter queue.
  • Test trajectories: all fail, partial fail, timeout, success-after-retry.
  • Measure silent-failure rate and mean steps-to-recovery per tool.

Key takeaways

  • Tool errors are a separate plane from transport retries and model hallucination.
  • Structured envelopes turn opaque failures into replannable facts.
  • Partial batch semantics must be explicit — all-or-nothing vs best-effort.
  • Runtime recovery + model replanning work together; neither alone is sufficient.
  • Harbor Integrations cut silent failures 34% → 2.4% with envelopes, batch manifests, and output guardrails.

Related reading