Guide
LLM agent tool call parsing and validation production systems explained
Harbor Ops’ internal support agents invoked forty-seven CRM and ticketing
tools through a standard
ReAct loop.
On paper the stack was correct: schemas in the prompt, native
tool_calls from the provider API, and a router that forwarded
arguments to integrations. In production, 38% of invocations
never reached a tool handler — the model hallucinated tool names outside
the run manifest, emitted JSON with trailing commas and single-quoted keys,
passed string amounts where integers were required, and occasionally wrapped
arguments in markdown fences the parser did not strip.
Worse, 11% of “successful” parses executed with
silently wrong types: a ticket id sent as a float, a boolean
"true" string that bypassed strict mode, and enum values one
character off that still deserialized. FinOps attributed
22% of wasted tool latency to parse-retry storms and
unstructured error messages
the model could not recover from. Engineers shipped a dedicated
parse-and-validate pipeline with streaming fragment assembly,
JSON Schema gates, argument sanitization, and bounded repair loops. Parse
failures fell to 4.2%; wrong-type executions dropped below
0.3%. This guide explains how production agents turn model
output into safe, executable tool calls — beyond the
function-calling primer.
Why parsing is a production gate, not a JSON.parse one-liner
Function calling looks deterministic in docs: the model returns a tool name and arguments object; your runtime executes it. Reality is messier across providers, model tiers, and streaming modes. Small models invent tools that sound plausible; large models over-nest arguments; streaming APIs deliver tool call fragments across dozens of chunks; and fallback models may switch from native tool blocks to inline JSON when the primary endpoint degrades.
What the parse-and-validate layer must guarantee
- Name resolution — map model output to a tool in the run-scoped manifest; reject unknown names without executing.
- Structural validity — produce a typed object matching the tool’s JSON Schema before any side effect.
- Semantic safety — clamp ranges, strip injection payloads, and coerce only where coercion is explicitly allowed.
- Streaming completeness — assemble partial tool call deltas without executing until the finish reason confirms completion.
- Actionable errors — return structured validation failures the model can fix in one repair turn.
- Auditability — log raw model output, normalized arguments, and validation verdict for replay.
Treat this layer like request validation at an API gateway: nothing hits your integrations until it passes.
Architecture: extractor, validator, sanitizer, executor
A production pipeline has four stages with clear failure boundaries:
1. Extraction (model output → raw call)
Prefer native tool call blocks from provider APIs when
available — they avoid markdown ambiguity and pair naturally with
structured output modes.
Maintain a fallback extractor for legacy text paths: regex or lightweight
parsers for <tool> XML tags, fenced JSON blocks, and
ReAct Action: lines. Never execute from the first successful
regex match without validation.
For
streaming responses,
buffer tool call fragments by index or id until
the stream signals tool_calls complete. Partial arguments are
useful for UI progress bars but must not reach the validator early.
2. Validation (raw call → typed arguments)
Validate against the run-scoped schema snapshot, not live
registry definitions that may have changed mid-run. Use JSON Schema with
additionalProperties: false on write tools, required-field
enforcement, and enum strictness. Reject type coercion surprises: if the
schema says integer, do not accept "42" unless
your policy explicitly allows string-to-number coercion and logs it.
3. Sanitization (typed arguments → safe arguments)
Even valid JSON can be dangerous. Strip HTML from string fields destined for email bodies, cap array lengths on bulk-update tools, normalize Unicode homoglyphs in identifiers, and run injection heuristics on free-text arguments that will be concatenated into downstream prompts. Map tenant-scoped ids server-side — never trust the model to supply a customer id without cross-checking session context.
4. Execution handoff
Only after validation + sanitization does the call enter the tool router: permission gates, parallelism policy, idempotency keys, and finally handler execution. Return validation errors as structured observations, not stack traces.
Native tool calls vs text extraction
Native provider tool_calls
Best default for production. Advantages: schema binding at the API layer, cleaner streaming assembly, fewer fence-and-prose failures. Still validate — providers do not guarantee enum compliance on every model tier. Watch for parallel tool call arrays: validate each element independently and apply batch policies from your error-handling guide.
Text / ReAct extraction
Needed for smaller open-weight models, custom hosts, or emergency fallback when native tools are disabled. Use a strict grammar where possible; when using free-form JSON extraction, run a repair pass (close braces, normalize quotes) before schema validation. Cap repair attempts — infinite repair loops burn tokens without improving outcomes.
Hybrid routing
Production fleets often route by model capability: native path for GPT-4 class models, text extractor for local Llama endpoints. Unify both behind the same validator so integrations see identical argument shapes regardless of extraction path.
Repair loops without runaway cost
When validation fails, feed the model a compact error envelope:
{
"error": "tool_validation_failed",
"tool": "update_ticket",
"field": "priority",
"expected": "enum: low | normal | high | urgent",
"received": "critial",
"hint": "Fix the tool call and retry. Do not apologize."
}
Bound repairs: Harbor Ops allows two validation repair turns per tool step before escalating to a smaller schema subset or human review. Pair with model fallback only for transport failures, not schema ignorance — falling back to a weaker model rarely fixes JSON Schema compliance.
Track metrics: validation failure rate by tool, by model, by schema version. Spikes after a prompt template rollout usually mean the model was not shown updated enum values in the manifest snapshot.
Security and abuse resistance
- Manifest allowlisting — reject any tool name not in
the run snapshot; never fuzzy-match
delete_all_ticketstodelete_ticket. - Write-tool stricter schemas — narrower enums, required confirmation tokens for destructive operations.
- Argument size limits — reject multi-megabyte JSON blobs the model pasted from context.
- Cross-field consistency — validate that
account_idin arguments matches the authenticated tenant session. - No prompt leakage via args — block tool arguments that embed system prompt fragments or credential patterns.
Route high-risk validated calls through human approval workflows before execution — validation proves shape, not intent.
Observability
Instrument each stage with spans: extract,
validate, sanitize, execute. Tag with
tool name, model id, schema version, extraction path (native vs text), and
repair attempt count. Alert when:
- Validation failure rate exceeds 8% over 15 minutes for a single tool.
- Repair loop average exceeds 1.4 attempts (schema or prompt drift).
- Unknown tool name rate climbs (manifest exposure bug or model drift).
- Streaming assembly timeouts grow (client disconnects mid-tool-call).
Store redacted raw tool call payloads for deterministic replay when debugging parse regressions across model upgrades.
Harbor Ops refactor
Harbor Ops shipped five changes in one release:
- Unified parse router — native and text extractors
emit a common
ParsedToolCallstruct. - Run-scoped schema snapshots — AJV validation with
additionalProperties: falseon all write tools. - Streaming assembler — per-index buffers; execute
only on
finish_reason: tool_calls. - Bounded repair prompts — max two validation repairs; structured error envelopes only.
- Sanitization profile per tool — string length caps, HTML strip, tenant id cross-check on CRM mutations.
Parse failures dropped from 38% to 4.2%. Wrong-type silent executions fell from 11% to 0.3%. Average tool step latency improved 19% because fewer tokens were wasted on unrecoverable retry loops.
Technique decision table
| Approach | Strength | Weakness | Best for |
|---|---|---|---|
| Trust model JSON, execute immediately | Lowest latency | Hallucinated tools, type bugs, injection | Prototypes only |
| Native tool_calls + JSON Schema validate | Reliable, provider-aligned | Still needs sanitization on writes | Default production path |
| Text extract + repair loop | Works on models without native tools | Higher failure rate; token cost | Fallback / open-weight hosts |
| Strict schema + zero repair | Predictable cost ceiling | More user-visible failures | High-risk financial tools |
| LLM-based “fix my JSON” | Handles exotic malformations | Extra model call; may hallucinate fixes | Offline batch repair only |
| Constrained decoding / grammar | Prevents invalid syntax at source | Provider support varies; schema limits | High-volume single-tool endpoints |
Pitfalls
- Validating against live registry — mid-run schema changes break in-flight runs; pin snapshots at run start.
- Executing on partial streams — half a JSON object reaches CRM before the closing brace arrives.
- Fuzzy tool name matching — maps destructive tools to benign handlers or vice versa.
- Coercing everything — silent string-to-number fixes hide model quality regressions.
- Generic error messages —
invalid JSONwithout field paths forces blind retries. - Unbounded repair loops — one bad enum burns an entire context window.
- Skipping sanitization on read tools — search arguments still reach SQL builders and log pipelines.
- Giant tool manifests without routing — models confuse similar tool names; use dynamic tool selection to shrink exposure.
Production checklist
- Pin JSON Schema snapshots per run; version schemas in the manifest.
- Prefer native tool_calls; keep a tested text extractor for fallback models.
- Buffer streaming tool fragments until the provider signals completion.
- Validate with strict schemas; set additionalProperties false on write tools.
- Sanitize strings, cap array sizes, and cross-check tenant-scoped ids.
- Return structured validation errors with field paths and expected types.
- Cap validation repair attempts per tool step (two is a common default).
- Reject unknown tool names; never fuzzy-match against the manifest.
- Log raw, normalized, and redacted argument forms for audit replay.
- Metric validation failure rate by tool, model, and schema version.
- Route destructive validated calls through approval gates where required.
- Regression-test parse pipeline on every model upgrade and prompt rollout.
Key takeaways
- Tool call parsing is a production safety gate — validate and sanitize before any side effect, even with native provider tool blocks.
- Pin schema snapshots per run so registry changes do not corrupt in-flight agent loops.
- Streaming requires explicit assembly; never execute partial tool call fragments.
- Structured validation errors enable bounded repair loops without token-wasting blind retries.
- Harbor Ops cut parse failures from 38% to 4.2% with a unified extractor, strict AJV gates, and per-tool sanitization profiles.
Related reading
- LLM function calling explained — primer on tool definitions, schemas, and provider APIs
- Structured output and JSON Schema pipeline explained — constrain model output at generation time
- Tool error handling and partial failure recovery explained — what happens after validation passes but execution fails
- ReAct agent loop explained — where parse-and-validate sits in the think-act-observe cycle