Guide

LLM agent tool call parsing and validation production systems explained

Harbor Ops’ internal support agents invoked forty-seven CRM and ticketing tools through a standard ReAct loop. On paper the stack was correct: schemas in the prompt, native tool_calls from the provider API, and a router that forwarded arguments to integrations. In production, 38% of invocations never reached a tool handler — the model hallucinated tool names outside the run manifest, emitted JSON with trailing commas and single-quoted keys, passed string amounts where integers were required, and occasionally wrapped arguments in markdown fences the parser did not strip.

Worse, 11% of “successful” parses executed with silently wrong types: a ticket id sent as a float, a boolean "true" string that bypassed strict mode, and enum values one character off that still deserialized. FinOps attributed 22% of wasted tool latency to parse-retry storms and unstructured error messages the model could not recover from. Engineers shipped a dedicated parse-and-validate pipeline with streaming fragment assembly, JSON Schema gates, argument sanitization, and bounded repair loops. Parse failures fell to 4.2%; wrong-type executions dropped below 0.3%. This guide explains how production agents turn model output into safe, executable tool calls — beyond the function-calling primer.

Why parsing is a production gate, not a JSON.parse one-liner

Function calling looks deterministic in docs: the model returns a tool name and arguments object; your runtime executes it. Reality is messier across providers, model tiers, and streaming modes. Small models invent tools that sound plausible; large models over-nest arguments; streaming APIs deliver tool call fragments across dozens of chunks; and fallback models may switch from native tool blocks to inline JSON when the primary endpoint degrades.

What the parse-and-validate layer must guarantee

Name resolution — map model output to a tool in the run-scoped manifest; reject unknown names without executing.
Structural validity — produce a typed object matching the tool’s JSON Schema before any side effect.
Semantic safety — clamp ranges, strip injection payloads, and coerce only where coercion is explicitly allowed.
Streaming completeness — assemble partial tool call deltas without executing until the finish reason confirms completion.
Actionable errors — return structured validation failures the model can fix in one repair turn.
Auditability — log raw model output, normalized arguments, and validation verdict for replay.

Treat this layer like request validation at an API gateway: nothing hits your integrations until it passes.

Architecture: extractor, validator, sanitizer, executor

A production pipeline has four stages with clear failure boundaries:

1. Extraction (model output → raw call)

Prefer native tool call blocks from provider APIs when available — they avoid markdown ambiguity and pair naturally with structured output modes. Maintain a fallback extractor for legacy text paths: regex or lightweight parsers for <tool> XML tags, fenced JSON blocks, and ReAct Action: lines. Never execute from the first successful regex match without validation.

For streaming responses, buffer tool call fragments by index or id until the stream signals tool_calls complete. Partial arguments are useful for UI progress bars but must not reach the validator early.

2. Validation (raw call → typed arguments)

Validate against the run-scoped schema snapshot, not live registry definitions that may have changed mid-run. Use JSON Schema with additionalProperties: false on write tools, required-field enforcement, and enum strictness. Reject type coercion surprises: if the schema says integer, do not accept "42" unless your policy explicitly allows string-to-number coercion and logs it.

3. Sanitization (typed arguments → safe arguments)

Even valid JSON can be dangerous. Strip HTML from string fields destined for email bodies, cap array lengths on bulk-update tools, normalize Unicode homoglyphs in identifiers, and run injection heuristics on free-text arguments that will be concatenated into downstream prompts. Map tenant-scoped ids server-side — never trust the model to supply a customer id without cross-checking session context.

4. Execution handoff

Only after validation + sanitization does the call enter the tool router: permission gates, parallelism policy, idempotency keys, and finally handler execution. Return validation errors as structured observations, not stack traces.

Native tool calls vs text extraction

Native provider tool_calls

Best default for production. Advantages: schema binding at the API layer, cleaner streaming assembly, fewer fence-and-prose failures. Still validate — providers do not guarantee enum compliance on every model tier. Watch for parallel tool call arrays: validate each element independently and apply batch policies from your error-handling guide.

Text / ReAct extraction

Needed for smaller open-weight models, custom hosts, or emergency fallback when native tools are disabled. Use a strict grammar where possible; when using free-form JSON extraction, run a repair pass (close braces, normalize quotes) before schema validation. Cap repair attempts — infinite repair loops burn tokens without improving outcomes.

Hybrid routing

Production fleets often route by model capability: native path for GPT-4 class models, text extractor for local Llama endpoints. Unify both behind the same validator so integrations see identical argument shapes regardless of extraction path.

Repair loops without runaway cost

When validation fails, feed the model a compact error envelope:

{
  "error": "tool_validation_failed",
  "tool": "update_ticket",
  "field": "priority",
  "expected": "enum: low | normal | high | urgent",
  "received": "critial",
  "hint": "Fix the tool call and retry. Do not apologize."
}

Bound repairs: Harbor Ops allows two validation repair turns per tool step before escalating to a smaller schema subset or human review. Pair with model fallback only for transport failures, not schema ignorance — falling back to a weaker model rarely fixes JSON Schema compliance.

Track metrics: validation failure rate by tool, by model, by schema version. Spikes after a prompt template rollout usually mean the model was not shown updated enum values in the manifest snapshot.

Security and abuse resistance

Manifest allowlisting — reject any tool name not in the run snapshot; never fuzzy-match delete_all_tickets to delete_ticket.
Write-tool stricter schemas — narrower enums, required confirmation tokens for destructive operations.
Argument size limits — reject multi-megabyte JSON blobs the model pasted from context.
Cross-field consistency — validate that account_id in arguments matches the authenticated tenant session.
No prompt leakage via args — block tool arguments that embed system prompt fragments or credential patterns.

Route high-risk validated calls through human approval workflows before execution — validation proves shape, not intent.

Observability

Instrument each stage with spans: extract, validate, sanitize, execute. Tag with tool name, model id, schema version, extraction path (native vs text), and repair attempt count. Alert when:

Validation failure rate exceeds 8% over 15 minutes for a single tool.
Repair loop average exceeds 1.4 attempts (schema or prompt drift).
Unknown tool name rate climbs (manifest exposure bug or model drift).
Streaming assembly timeouts grow (client disconnects mid-tool-call).

Store redacted raw tool call payloads for deterministic replay when debugging parse regressions across model upgrades.

Harbor Ops refactor

Harbor Ops shipped five changes in one release:

Unified parse router — native and text extractors emit a common ParsedToolCall struct.
Run-scoped schema snapshots — AJV validation with additionalProperties: false on all write tools.
Streaming assembler — per-index buffers; execute only on finish_reason: tool_calls.
Bounded repair prompts — max two validation repairs; structured error envelopes only.
Sanitization profile per tool — string length caps, HTML strip, tenant id cross-check on CRM mutations.

Parse failures dropped from 38% to 4.2%. Wrong-type silent executions fell from 11% to 0.3%. Average tool step latency improved 19% because fewer tokens were wasted on unrecoverable retry loops.

Technique decision table

Approach	Strength	Weakness	Best for
Trust model JSON, execute immediately	Lowest latency	Hallucinated tools, type bugs, injection	Prototypes only
Native tool_calls + JSON Schema validate	Reliable, provider-aligned	Still needs sanitization on writes	Default production path
Text extract + repair loop	Works on models without native tools	Higher failure rate; token cost	Fallback / open-weight hosts
Strict schema + zero repair	Predictable cost ceiling	More user-visible failures	High-risk financial tools
LLM-based “fix my JSON”	Handles exotic malformations	Extra model call; may hallucinate fixes	Offline batch repair only
Constrained decoding / grammar	Prevents invalid syntax at source	Provider support varies; schema limits	High-volume single-tool endpoints

Pitfalls

Validating against live registry — mid-run schema changes break in-flight runs; pin snapshots at run start.
Executing on partial streams — half a JSON object reaches CRM before the closing brace arrives.
Fuzzy tool name matching — maps destructive tools to benign handlers or vice versa.
Coercing everything — silent string-to-number fixes hide model quality regressions.
Generic error messages — invalid JSON without field paths forces blind retries.
Unbounded repair loops — one bad enum burns an entire context window.
Skipping sanitization on read tools — search arguments still reach SQL builders and log pipelines.
Giant tool manifests without routing — models confuse similar tool names; use dynamic tool selection to shrink exposure.

Production checklist

Pin JSON Schema snapshots per run; version schemas in the manifest.
Prefer native tool_calls; keep a tested text extractor for fallback models.
Buffer streaming tool fragments until the provider signals completion.
Validate with strict schemas; set additionalProperties false on write tools.
Sanitize strings, cap array sizes, and cross-check tenant-scoped ids.
Return structured validation errors with field paths and expected types.
Cap validation repair attempts per tool step (two is a common default).
Reject unknown tool names; never fuzzy-match against the manifest.
Log raw, normalized, and redacted argument forms for audit replay.
Metric validation failure rate by tool, model, and schema version.
Route destructive validated calls through approval gates where required.
Regression-test parse pipeline on every model upgrade and prompt rollout.

Key takeaways

Tool call parsing is a production safety gate — validate and sanitize before any side effect, even with native provider tool blocks.
Pin schema snapshots per run so registry changes do not corrupt in-flight agent loops.
Streaming requires explicit assembly; never execute partial tool call fragments.
Structured validation errors enable bounded repair loops without token-wasting blind retries.
Harbor Ops cut parse failures from 38% to 4.2% with a unified extractor, strict AJV gates, and per-tool sanitization profiles.