Guide

LLM tool error handling explained

Harbor Logistics' shipment assistant looked healthy in demos. In production, when lookup_inventory returned HTTP 404 for a mistyped SKU, the runtime passed the raw HTML error page into the observation field. The model hallucinated stock on hand, called create_shipment_label, and when that API returned 409 DUPLICATE_LABEL, the agent retried the same call three more times — issuing four paid labels for one pallet. Support tickets spiked; finance traced $2,400 in void fees to “tool error handling” that was really no handling at all.

Tool error handling is the layer between your orchestrator and the LLM that decides what happens when a function call fails: how errors are classified, whether to retry, what observation the model sees next, and when to escalate to a human. It is distinct from LLM provider retry logic (429s on OpenAI) and from output parsing (malformed JSON from the model). This guide covers tool-side error taxonomy, structured observations for ReAct loops, idempotency and retry policy, alternate-tool fallbacks, the Harbor Logistics refactor, a technique decision table, pitfalls, and a production checklist.

Tool error taxonomy

Not every tool failure should look the same to the model. Classify errors at the orchestrator before serializing an observation:

Class	Typical cause	Retry?	Model should…
Validation	Bad args, schema mismatch, unknown enum	No	Fix arguments or ask user for missing fields
Not found	SKU, user ID, or record absent	No	Try alternate lookup or clarify with user
Auth / permission	Expired token, RBAC denial	After refresh only	Stop mutating; escalate or use read-only path
Transient	Timeout, 502/503, connection reset	Yes (bounded)	Wait for orchestrator retry or pick backup tool
Rate limit	429 from downstream API	Yes with backoff	Defer or batch; do not spam identical calls
Conflict	409 duplicate, optimistic-lock failure	No (idempotent read first)	Fetch current state, reconcile, or abort
Partial success	Batch API: 8/10 rows OK	Per-item	Retry only failed IDs; never replay successes

Map HTTP status codes and vendor error codes into this taxonomy in one place — not scattered across tool implementations. The model receives error_class, a short message, and optional retry_after_ms or suggested_action fields.

Structured observations: what the model must see

Raw stack traces and HTML error pages poison agent reasoning. Serialize tool failures as compact JSON observations the model can act on:

{
  "status": "error",
  "tool": "lookup_inventory",
  "error_class": "not_found",
  "message": "SKU WIDGET-99X not in warehouse EAST-01",
  "retryable": false,
  "hints": ["check_sku_format", "try_search_products"]
}

Contrast with success observations that use the same envelope ("status": "ok") so the scratchpad stays uniform. For function calling providers, return errors as tool result messages — never as assistant text the model authored. Truncate large error bodies; keep the first 500 characters of vendor detail in logs only.

Hints are optional machine-readable tags your policy layer adds: refresh_oauth, human_required, use_fallback_tool:geocode_nominatim. They steer recovery without expanding the tool catalog every turn.

Retry policy and idempotency

The orchestrator — not the LLM — should own retries for transient tool failures. Letting the model “try again” on validation errors duplicates side effects.

Idempotency keys — pass a stable key derived from (session_id, tool_name, canonical_args_hash) on every mutating call. Downstream APIs that honor Idempotency-Key headers make retries safe.
Retry budget — cap at 2–3 attempts for transient errors with exponential backoff; never retry validation or conflict without changed inputs.
Duplicate-action detection — if the model emits the same mutating action twice with identical args within one task, block the second call and return a conflict observation.
Timeout hierarchy — tool timeout < step timeout < user-facing deadline. Cancel in-flight HTTP when the step budget expires.

Read-only tools (search, GET) can retry aggressively. POST/PUT/DELETE require idempotency or explicit human approval after any failure.

Alternate tools and graceful degradation

When a primary tool fails, route to a backup before asking the model to improvise:

Primary failure	Fallback pattern
Geocoding API down	Secondary provider or cached ZIP centroid
SQL warehouse timeout	Pre-aggregated summary table or cached dashboard snapshot
Payment capture error	Hold order in pending; never double-charge via retry
RAG retrieval empty	Broader hybrid search or “I don't have docs on that” template

Register fallbacks in tool metadata so the orchestrator can inject suggested_action without exposing ten tools to the model each turn. Pair with guardrails that block mutating fallbacks when auth errors occur.

Harbor Logistics refactor (worked example)

The shipment assistant ran a standard ReAct loop: lookup inventory, reserve stock, create label, notify customer. Failures were logged but observations were inconsistent — sometimes empty, sometimes full HTTP bodies.

Changes shipped:

Error envelope middleware wrapped every tool executor; all failures returned the JSON schema above.
Classifier table mapped 4xx/5xx and vendor codes to error_class; 404 on inventory became not_found with SKU and warehouse in message.
Retry gate — only transient and rate_limit triggered orchestrator retries; the model never saw retryable=true on 409 conflicts.
Idempotency-Key on create_shipment_label keyed by order_id; duplicate agent attempts returned the existing label URL.
Escalation hook — two consecutive tool failures in one task opened a human queue ticket with full scratchpad attached.

Label duplication incidents dropped to zero over six weeks; mean agent steps per successful shipment fell from 5.2 to 3.8 because the model stopped thrashing on unrecoverable errors.

Technique decision table

Approach	Use when	Avoid when
Structured error observations	Any multi-step agent with tools	Single-shot completions with no tools
Orchestrator-owned retry	Transient downstream failures	Validation errors the model must fix
Let model retry freely	Read-only search with no quota cost	Mutating APIs, payments, inventory
Fail-fast to human	Auth failures, policy blocks, high-value transactions	Benign lookup misses where alternate tools exist
Circuit breaker on tool	Dependency in sustained outage	Rare one-off timeouts

Common pitfalls

Raw HTML in observations — models parse noise as data; Harbor's 404 pages looked like product descriptions.
Retrying 409 conflicts — duplicates labels, charges, and tickets; always reconcile state first.
Hiding errors from the model — returning empty observations makes the agent guess; be explicit about failure.
Same retry logic as LLM provider — tool APIs have different idempotency and rate limits.
Leaking secrets in error messages — strip API keys and internal hostnames before observations hit context.
Unbounded tool thrashing — no max steps per task; agents loop on permanent failures until budget exhaustion.
Evaluating only final answers — dangerous recovery paths (wrong retry on payment) never show up in string-match evals.

Production checklist

Central error classifier maps vendor codes to retryable taxonomy.
All tool results use a uniform JSON envelope (ok vs error).
Observations exclude stack traces, HTML, and secrets.
Idempotency keys on every mutating tool call.
Orchestrator retries transient errors; model cannot retry validation failures blindly.
Duplicate-action detection for identical mutating calls in one task.
Fallback tools registered with automatic or hint-driven routing.
Human escalation after N consecutive tool failures or auth errors.
Metrics: error rate by tool, retry count, observation truncation rate.
Golden tests assert error observations and recovery paths, not just happy paths.
Timeouts aligned: tool < step < user deadline.
Full error detail in logs/traces via observability tooling; compressed view in LLM context.

Key takeaways

Tool errors are observations, not exceptions you swallow.
Classify before retry; mutating retries need idempotency.
The orchestrator owns retry policy, not the LLM.
Structured envelopes beat raw API dumps for agent reasoning.
Test failure paths as rigorously as success paths.