Guide
LLM agent tool error handling and partial failure recovery explained
Harbor Integrations ships a CRM sync agent that updates contacts, creates
tasks, and posts activity notes across Salesforce and HubSpot in a single
run. The first production build passed raw HTTP status codes and Java stack
traces back to the model as tool “results.” When three of five
parallel writes failed with 403 Forbidden, the model summarized
“sync complete” because two calls succeeded and the error blobs
looked like noise.
34% of audited runs left customer records half-updated
with no alert — support tickets blamed “the AI lied.”
After engineering introduced structured tool error envelopes,
explicit partial-batch semantics, and a recovery policy matrix coordinated
with
retry budgets
and
compensating rollbacks,
silent-failure rate fell to 2.4% while mean steps-to-recovery
dropped from 4.8 to 1.6.
Tool error handling in agent systems is not the same problem as HTTP retry or circuit breaking. Those layers decide whether to re-execute a call. This layer decides what the model sees after a tool returns — success, partial success, retriable failure, permanent failure, or ambiguous timeout — and which recovery actions the runtime may take without the model hallucinating a clean outcome. This guide covers error envelopes, batch semantics, injection into the observation stream, recovery policy selection, Harbor Integrations’ refactor, a technique decision table, pitfalls, and a production checklist.
Why tool errors differ from model and transport errors
Three failure planes stack in every agent run, and conflating them causes the silent-failure pattern Harbor Integrations hit:
- Transport errors — TCP reset, 503, rate limit. Handled by retry and backoff before the model ever sees a result.
- Tool execution errors — valid HTTP 200 with
{ "error": "contact_locked" }, or 404 because the record was deleted. The call “succeeded” at the wire layer but failed at the business layer. - Model reasoning errors — the model misreads a well-formed error envelope and claims success anyway. Caught by output validation and structured final-answer contracts.
Tool error handling sits squarely in the middle plane. Its job is to normalize every tool outcome into a schema the model can reason about, attach enough context for replanning, and prevent optimistic synthesis when any required step failed.
The structured error envelope
Never pass raw stack traces or empty strings to the model. Production agents wrap every tool result — success or failure — in a consistent envelope:
status— one ofok,partial,error,timeout,cancelled.error_code— stable machine identifier (CONTACT_LOCKED,QUOTA_EXCEEDED) mapped from vendor codes; never rely on the model parsing HTTP status.retriable— boolean hint so the model and runtime agree on whether another attempt is worthwhile.message— one-line human summary for the model; scrub secrets and PII before injection.data— payload on success or partial success; omitted or null on hard failure.metadata— tool name, call ID, latency, idempotency key, affected entity IDs for saga tracking.
On partial, include both succeeded and failed sub-operations
in data with per-item status. The model must see
“3 of 5 contacts updated; 2 failed with CONTACT_LOCKED”
not a blended JSON array where failures are indistinguishable from
empty results.
Partial failure in parallel and sequential batches
Agents increasingly fire parallel tool calls. Without explicit batch semantics, partial success is ambiguous:
Parallel batch policies
- All-or-nothing — if any required tool fails,
mark the whole batch
errorand trigger rollback of committed siblings. Use when operations must be atomic from the customer’s perspective. - Best-effort — return
partialwith per-item status; model decides whether to retry failed items or notify the user. Use for bulk enrichment where incomplete is acceptable if disclosed. - Fail-fast — cancel in-flight siblings on first hard failure; return immediately. Saves quota when later steps depend on the first result.
Sequential dependency chains
When tool B depends on tool A’s output, a failure at A should
short-circuit B with a synthetic skipped_dependency_failed
envelope rather than letting the model call B with null inputs and
produce nonsense. Log the skip in
tracing
as a distinct span event.
Recovery policy matrix: retry, skip, escalate, rollback
After normalizing the error, the runtime — not always the model — selects a recovery action:
| Condition | Runtime action | Model role |
|---|---|---|
| Transient + retriable + budget remaining | Auto-retry with backoff; inject final outcome only | None until retry exhausted |
| Permanent business error (404, validation) | Inject error envelope; no retry | Replan alternate path or ask user |
| Partial batch, best-effort policy | Inject partial envelope with item manifest | Retry failed items or summarize gaps |
| Side effect committed, sibling failed | Trigger compensating transaction | Explain rollback to user if visible |
| High-risk tool (payments, deletes) | Pause run; route to HITL queue | Wait for approval packet |
| Repeated identical failure (poison) | Stop loop; send to dead letter queue | None — run terminated |
Encode this matrix in middleware so every tool adapter shares the same behavior. Letting each tool author return ad-hoc strings guarantees inconsistent recovery and untestable agent trajectories.
Injecting errors into the observation stream
How you present failures to the model shapes whether it recovers or hallucinates:
- Always include failure count in the system reminder
when any tool in the last turn returned
errororpartial— e.g. “2 tools failed; you must not claim full success.” - Cap error detail length — truncate vendor
payloads through
summarization
but preserve
error_codeand entity IDs. - Distinguish tool-not-called from tool-failed — models conflate “no data returned” with “empty success.”
- Surface timeout ambiguity — when outcome is
unknown, status is
timeoutwithretriable: trueand an idempotency probe hint, noterror.
Harbor Integrations added a post-tool middleware hook that appends a
one-line run_health block after every tool round:
{ "tools_ok": 2, "tools_failed": 1, "blocking_failure": true }.
Final-answer guardrails reject any user-facing message containing
“complete” or “success” when
blocking_failure is true.
Harbor Integrations refactor walkthrough
The team shipped four changes over two sprints:
- Adapter envelope — every CRM connector returns
the standard schema; vendor SDK exceptions are caught and mapped to
error_codeat the adapter boundary. - Batch manifest — parallel
update_contactscalls returnpartialwith a per-ID array; all-or-nothing mode wraps the batch in a saga with undo hooks. - Recovery middleware — implements the policy
matrix; auto-retries only
retriable: truewithin per-step budget before the model sees the failure. - Output guardrail — JSON schema on final
response requires
sync_statusenum matching actual tool outcomes; mismatch triggers one self-correction turn, then HITL.
Silent failures dropped from 34% to 2.4% on a 500-run audit set. Remaining failures were genuine ambiguous timeouts where idempotency probes were not yet implemented — tracked as phase two.
Technique decision table
| Approach | Strengths | Weaknesses | Best for |
|---|---|---|---|
| Raw error strings to model | Fast to prototype | Silent failures, token waste, no policy | Local demos only |
| Structured error envelope | Consistent replanning, testable | Adapter work per integration | All production agents with tools |
| Runtime-only recovery (no model) | Deterministic, fast | Cannot handle novel failures | Transient retries, rollbacks |
| Model-driven recovery | Flexible alternate paths | Hallucination risk without guardrails | Business-logic failures after envelope injection |
| Fail entire run on any error | Simple correctness story | Poor UX, wasted successful work | Financial transactions, safety-critical |
Common pitfalls
- Success on empty — tool returns
{}on 404; model treats as valid data. Enforce non-empty success contracts. - Swallowed parallel failures — Promise.all without per-item catch loses which call failed. Use allSettled semantics.
- Retry at two layers — HTTP client and agent runtime both retry, doubling side effects. Centralize retry ownership.
- Error text as training leakage — vendor responses contain PII; scrub before model injection and trace export.
- No blocking-failure flag — model writes cheerful summary over failed required steps. Add run_health and guardrails.
- Timeout treated as permanent — duplicate
commits on retry. Use idempotency keys and
timeoutstatus. - Infinite replan loops — model retries same invalid input forever. Cap recovery turns; escalate to DLQ.
Production checklist
- Define standard tool result envelope (status, error_code, retriable, data, metadata).
- Map every vendor exception to stable error_code at adapter boundary.
- Choose batch policy per tool group: all-or-nothing, best-effort, or fail-fast.
- Implement short-circuit for sequential dependency failures.
- Centralize recovery policy matrix in middleware, not per-tool.
- Auto-retry only retriable errors within per-step budget before model observation.
- Inject run_health summary after each tool round with blocking_failure flag.
- Validate final user message against actual tool outcome manifest.
- Wire compensating rollbacks for atomic multi-write batches.
- Route poison patterns (same error_code N times) to dead letter queue.
- Test trajectories: all fail, partial fail, timeout, success-after-retry.
- Measure silent-failure rate and mean steps-to-recovery per tool.
Key takeaways
- Tool errors are a separate plane from transport retries and model hallucination.
- Structured envelopes turn opaque failures into replannable facts.
- Partial batch semantics must be explicit — all-or-nothing vs best-effort.
- Runtime recovery + model replanning work together; neither alone is sufficient.
- Harbor Integrations cut silent failures 34% → 2.4% with envelopes, batch manifests, and output guardrails.
Related reading
- LLM agent retry, backoff and transient failure recovery explained — when to re-execute at the transport layer
- LLM agent parallel tool execution explained — concurrency caps and dependency graphs
- LLM agent compensating transactions and saga rollbacks explained — undoing partial commits
- LLM agent guardrails and output validation explained — catching success claims that contradict tool state