Guide
LLM tool error handling explained
Harbor Logistics' shipment assistant looked healthy in demos. In production,
when lookup_inventory returned HTTP 404 for a mistyped SKU, the
runtime passed the raw HTML error page into the observation field. The model
hallucinated stock on hand, called create_shipment_label, and when
that API returned 409 DUPLICATE_LABEL, the agent retried the same
call three more times — issuing four paid labels for one pallet. Support
tickets spiked; finance traced $2,400 in void fees to “tool error
handling” that was really no handling at all.
Tool error handling is the layer between your orchestrator and the LLM that decides what happens when a function call fails: how errors are classified, whether to retry, what observation the model sees next, and when to escalate to a human. It is distinct from LLM provider retry logic (429s on OpenAI) and from output parsing (malformed JSON from the model). This guide covers tool-side error taxonomy, structured observations for ReAct loops, idempotency and retry policy, alternate-tool fallbacks, the Harbor Logistics refactor, a technique decision table, pitfalls, and a production checklist.
Tool error taxonomy
Not every tool failure should look the same to the model. Classify errors at the orchestrator before serializing an observation:
| Class | Typical cause | Retry? | Model should… |
|---|---|---|---|
| Validation | Bad args, schema mismatch, unknown enum | No | Fix arguments or ask user for missing fields |
| Not found | SKU, user ID, or record absent | No | Try alternate lookup or clarify with user |
| Auth / permission | Expired token, RBAC denial | After refresh only | Stop mutating; escalate or use read-only path |
| Transient | Timeout, 502/503, connection reset | Yes (bounded) | Wait for orchestrator retry or pick backup tool |
| Rate limit | 429 from downstream API | Yes with backoff | Defer or batch; do not spam identical calls |
| Conflict | 409 duplicate, optimistic-lock failure | No (idempotent read first) | Fetch current state, reconcile, or abort |
| Partial success | Batch API: 8/10 rows OK | Per-item | Retry only failed IDs; never replay successes |
Map HTTP status codes and vendor error codes into this taxonomy in one place
— not scattered across tool implementations. The model receives
error_class, a short message, and optional
retry_after_ms or suggested_action fields.
Structured observations: what the model must see
Raw stack traces and HTML error pages poison agent reasoning. Serialize tool failures as compact JSON observations the model can act on:
{
"status": "error",
"tool": "lookup_inventory",
"error_class": "not_found",
"message": "SKU WIDGET-99X not in warehouse EAST-01",
"retryable": false,
"hints": ["check_sku_format", "try_search_products"]
}
Contrast with success observations that use the same envelope
("status": "ok") so the scratchpad stays uniform. For
function calling
providers, return errors as tool result messages — never as assistant
text the model authored. Truncate large error bodies; keep the first 500
characters of vendor detail in logs only.
Hints are optional machine-readable tags your policy layer
adds: refresh_oauth, human_required,
use_fallback_tool:geocode_nominatim. They steer recovery without
expanding the tool catalog every turn.
Retry policy and idempotency
The orchestrator — not the LLM — should own retries for transient tool failures. Letting the model “try again” on validation errors duplicates side effects.
- Idempotency keys — pass a stable key derived from
(session_id, tool_name, canonical_args_hash) on every mutating call.
Downstream APIs that honor
Idempotency-Keyheaders make retries safe. - Retry budget — cap at 2–3 attempts for transient errors with exponential backoff; never retry validation or conflict without changed inputs.
- Duplicate-action detection — if the model emits the same mutating action twice with identical args within one task, block the second call and return a conflict observation.
- Timeout hierarchy — tool timeout < step timeout < user-facing deadline. Cancel in-flight HTTP when the step budget expires.
Read-only tools (search, GET) can retry aggressively. POST/PUT/DELETE require idempotency or explicit human approval after any failure.
Alternate tools and graceful degradation
When a primary tool fails, route to a backup before asking the model to improvise:
| Primary failure | Fallback pattern |
|---|---|
| Geocoding API down | Secondary provider or cached ZIP centroid |
| SQL warehouse timeout | Pre-aggregated summary table or cached dashboard snapshot |
| Payment capture error | Hold order in pending; never double-charge via retry |
| RAG retrieval empty | Broader hybrid search or “I don't have docs on that” template |
Register fallbacks in tool metadata so the orchestrator can inject
suggested_action without exposing ten tools to the model each turn.
Pair with
guardrails
that block mutating fallbacks when auth errors occur.
Harbor Logistics refactor (worked example)
The shipment assistant ran a standard ReAct loop: lookup inventory, reserve stock, create label, notify customer. Failures were logged but observations were inconsistent — sometimes empty, sometimes full HTTP bodies.
Changes shipped:
- Error envelope middleware wrapped every tool executor; all failures returned the JSON schema above.
- Classifier table mapped 4xx/5xx and vendor codes to
error_class; 404 on inventory becamenot_foundwith SKU and warehouse inmessage. - Retry gate — only
transientandrate_limittriggered orchestrator retries; the model never saw retryable=true on 409 conflicts. - Idempotency-Key on
create_shipment_labelkeyed by order_id; duplicate agent attempts returned the existing label URL. - Escalation hook — two consecutive tool failures in one task opened a human queue ticket with full scratchpad attached.
Label duplication incidents dropped to zero over six weeks; mean agent steps per successful shipment fell from 5.2 to 3.8 because the model stopped thrashing on unrecoverable errors.
Technique decision table
| Approach | Use when | Avoid when |
|---|---|---|
| Structured error observations | Any multi-step agent with tools | Single-shot completions with no tools |
| Orchestrator-owned retry | Transient downstream failures | Validation errors the model must fix |
| Let model retry freely | Read-only search with no quota cost | Mutating APIs, payments, inventory |
| Fail-fast to human | Auth failures, policy blocks, high-value transactions | Benign lookup misses where alternate tools exist |
| Circuit breaker on tool | Dependency in sustained outage | Rare one-off timeouts |
Common pitfalls
- Raw HTML in observations — models parse noise as data; Harbor's 404 pages looked like product descriptions.
- Retrying 409 conflicts — duplicates labels, charges, and tickets; always reconcile state first.
- Hiding errors from the model — returning empty observations makes the agent guess; be explicit about failure.
- Same retry logic as LLM provider — tool APIs have different idempotency and rate limits.
- Leaking secrets in error messages — strip API keys and internal hostnames before observations hit context.
- Unbounded tool thrashing — no max steps per task; agents loop on permanent failures until budget exhaustion.
- Evaluating only final answers — dangerous recovery paths (wrong retry on payment) never show up in string-match evals.
Production checklist
- Central error classifier maps vendor codes to retryable taxonomy.
- All tool results use a uniform JSON envelope (ok vs error).
- Observations exclude stack traces, HTML, and secrets.
- Idempotency keys on every mutating tool call.
- Orchestrator retries transient errors; model cannot retry validation failures blindly.
- Duplicate-action detection for identical mutating calls in one task.
- Fallback tools registered with automatic or hint-driven routing.
- Human escalation after N consecutive tool failures or auth errors.
- Metrics: error rate by tool, retry count, observation truncation rate.
- Golden tests assert error observations and recovery paths, not just happy paths.
- Timeouts aligned: tool < step < user deadline.
- Full error detail in logs/traces via observability tooling; compressed view in LLM context.
Key takeaways
- Tool errors are observations, not exceptions you swallow.
- Classify before retry; mutating retries need idempotency.
- The orchestrator owns retry policy, not the LLM.
- Structured envelopes beat raw API dumps for agent reasoning.
- Test failure paths as rigorously as success paths.
Related reading
- LLM function calling explained — schemas, tool results, and the multi-turn call loop
- LLM ReAct agent loop explained — where observations land in the scratchpad
- LLM retry and fallback explained — provider-side resilience (distinct from tool errors)
- AI agents and tool use explained — broader patterns and guardrails