Guide

LLM agent structured output and JSON schema pipeline systems explained

Harbor CRM shipped a sales assistant that drafted emails, logged calls, and created contacts through a function-calling loop. Demos were smooth: the model returned polite prose and a JSON blob with first_name, last_name, and account_id. Two weeks after launch, Salesforce rolled API v58 and renamed AccountId to Account__c on a custom object path the agent used. Nobody updated the prompt examples. Tool invocations began failing with HTTP 400 — 42% of create_contact steps in the first sprint. Support blamed “the AI hallucinating,” but traces showed the model was consistent: it emitted JSON that matched the old schema copy-pasted into the system prompt. The runtime had no single source of truth, no version pin, and no repair path when parse or schema validation failed. After Harbor rebuilt around a structured output pipeline — registry, provider mode selection, normalize stage, and capped repair loops — hard tool failures fell to 3.1% and mean time to adapt a vendor schema change dropped from nine days to under four hours.

Agent structured output is not the same as asking the model to “respond in JSON.” Production pipelines treat schemas as contracts: one registry feeds the prompt, the provider API, the parser, and downstream output validation gates. This guide covers registry design, delivery modes (native JSON schema, tool envelopes, grammar constraints), the parse-normalize-repair chain, multi-step schema handoffs, the Harbor CRM refactor, a decision table versus prompt-only JSON and post-hoc regex extraction, implementation pitfalls, and a checklist for teams wiring tool schemas into reliable agent runtimes.

What a schema pipeline does in an agent loop

Every agent step that triggers code needs machine-readable output with predictable shape. That includes explicit tool calls, implicit “action JSON” blocks, planner DAGs, and intermediate plans passed between subagents. A schema pipeline answers four questions on every generation:

Which schema applies? Tool name, step type, tenant, API version.
How is it conveyed to the model? Provider-native structured mode, tool definition, or appended JSON Schema in the user message.
How is raw text turned into a typed object? Extract, parse, normalize types, reject duplicates and trailing prose.
What happens on failure? Structured error back to the model, fallback model, or human escalation — never silent coercion.

Without a pipeline, teams duplicate schema fragments across prompts, OpenAPI stubs, and Pydantic models. Drift is inevitable. The pipeline’s core artifact is a schema registry: versioned JSON Schema (or equivalent) documents keyed by (tool_id, api_version, tenant_tier), rendered into whatever each provider expects at call time.

Schema registry design

Treat the registry like an API catalog, not prompt documentation:

Stable IDs — crm.create_contact@v3 survives field renames inside the document via version bumps.
Provider views — one canonical schema; adapters emit OpenAI response_format, Anthropic tool blocks, or local Zod types.
Compatibility matrix — which models support strict JSON schema mode vs grammar-only vs post-hoc validation.
Changelog — machine-readable diff between v2 and v3 injected into repair prompts when validation fails after a deploy.
Size budgets — large nested schemas blow context; project fields per step via dynamic tool exposure.

Harbor pinned salesforce_api=58 on each run record. When v59 shipped, they added crm.create_contact@v4, ran shadow traffic for 48 hours, and flipped the pin only after parse success rate matched v3 on golden transcripts. Prompt writers no longer edit JSON by hand.

Schema authoring rules

Follow the same discipline as tool schema design: flat objects where possible, explicit enums instead of free strings, required arrays for list fields, additionalProperties: false on strict paths, and descriptions that double as repair hints (“must be 18-char Salesforce ID from search_accounts result”). Avoid polymorphic oneOf trees the model confuses unless you split into separate tools.

Provider delivery modes

Three families dominate production agents in 2026. Pick per step based on model support and latency budget:

Mode	Mechanism	Strength	Weakness
Native JSON / structured outputs	API constrains tokens to schema (logits mask or grammar)	Highest parse success; fewer repair tokens	Model-specific; schema size limits; not all fields on all tiers
Tool / function envelope	Model emits `tool_calls[]` with `arguments` string	Natural fit for multi-tool loops; parallel calls	Arguments still need parse + validate; parallel partial failures
Prompt + post-hoc validation	Schema in prompt; regex or JSON.parse after	Works on any model	Lowest reliability; expensive repair; markdown leakage

See structured outputs explained for provider feature comparison. Agent pipelines often combine modes: planner step uses strict JSON mode for a small plan object; executor step uses tool envelopes for side effects. Never mix two schemas in one unconstrained generation — the model will blend keys.

Grammar and constrained decoding

When native schema mode is unavailable, grammar-based decoders (GBNF, regex automata) guarantee syntactic validity but not semantic correctness. Still run registry validation after decode. Grammar shines for fixed micro-formats (ISO dates, ticket IDs) embedded in larger objects.

Parse, normalize, and repair

Even with constrained decoding, production paths need a deterministic extraction stage:

raw_completion → extract_payload → json_parse → schema_validate → [commit | repair | escalate]

Extract

Priority order: tool-call arguments if present; else provider parsed field; else fenced ```json block; else first balanced {...} with depth counter. Reject multiple JSON roots. Log extraction method per span for quality dashboards.

Normalize

Coerce only when schema explicitly allows: trim strings, map legacy aliases (account_id → Account__c) via registry migration rules, drop unknown keys if policy permits. Do not silently fill required fields with null — that hides model failure until the API 400s.

Repair loop

On schema_validate failure, inject validator errors verbatim (JSON Pointer paths help). Cap at 2–3 attempts per step; charge repair tokens against context budget. For tool argument errors, prefer re-issuing the tool schema snippet from the registry over paraphrasing in natural language. If repair exhausts, return a structured observation to the agent loop per tool error handling patterns — do not retry the same temperature hoping for luck.

Multi-step schema chaining

Complex workflows pass typed objects between steps: classifier → router → executor. Each hop should have its own registry entry and smaller schema. Anti-patterns:

One mega-schema with 40 optional fields — the model omits required nested objects randomly.
Re-serializing tool results as unstructured prose between steps — re-parse errors compound.
Subagents returning free-text summaries to a parent that expects JSON — force structured handoff packages.

Harbor split create_contact into plan_contact (strict JSON: fields to collect) and commit_contact (tool call only after plan validates). Plan steps never touch Salesforce; commit steps never invent fields not in the plan object. Failure isolation improved because repair prompts referenced half-sized schemas.

Harbor CRM refactor

Before: schemas lived in a Notion doc, copied into prompts weekly. After:

Git-backed registry with CI tests: every schema must round-trip through provider adapters and validate golden completions.
Run metadata schema_bundle_hash on each trace for replay.
Shadow mode on schema bumps: validate production completions against vNext without executing tools.
Semantic policy layer after schema gate (account ID must exist in session cache) — delegated to guardrails, not duplicated in prompts.
Alert when extract_method=markdown_fence rate rises — signals model drift off tool mode.

Hard failures (HTTP 4xx from tools due to shape mismatch) went from 42% to 3.1% of commit steps. Remaining failures were true business rule rejections caught by policy, not parse noise.

Decision table: pipeline vs alternatives

Approach	When it fits	Production risk
Centralized schema pipeline (this guide)	Multi-tool agents, vendor API integrations, regulated writes	Low when registry is CI-tested and version-pinned
Prompt-only “return JSON”	Prototypes, read-only demos, single-field extractions	High — markdown wrappers, key drift, no version story
Regex / LLM-as-parser second pass	Legacy models without structured modes	Medium — doubles latency and cost; parser can hallucinate
Guardrails only (no registry)	Adding policy on top of existing ad-hoc JSON	Medium — validation without a single schema source still drifts

Pipelines complement guardrails: schema proves shape; policy proves truth. Pipelines complement reflection loops: critique prose quality; schemas gate executable payloads.

Common pitfalls

Duplicating schema in prompt and tool definition — they will diverge; generate prompt snippets from the registry.
Over-strict additionalProperties: false on evolving APIs — breaks forward compatibility; version instead of forbidding all extras.
Repair loops without error pointers — “invalid JSON” wastes tokens; pass AJV / Pydantic paths.
Coercing invalid enums to defaults — masks hallucination; reject and repair.
Logging full payloads with PII — redact before trace export; schema paths are enough for debug.
Skipping extraction metrics — rising fence-extraction rate means tool mode regressed before users notice.

Production checklist

Define canonical JSON Schema per tool/step with versioned IDs in a git-backed registry.
Generate provider tool definitions and prompt fragments from the registry in CI.
Pin external API version on each run; document bump procedure with shadow validation.
Select delivery mode per step: native structured > tool envelope > prompt-only.
Implement extract → parse → schema_validate with logged extraction method.
Cap repair attempts; feed validator errors with JSON Pointer paths.
Wire schema failures into structured tool-error observations for the agent loop.
Attach schema_bundle_hash to traces for replay and incident triage.
Alert on parse-failure rate and markdown-fence extraction spikes per tool.
Hand off validated objects to semantic policy gates before irreversible commits.

Key takeaways

Agent structured output needs a pipeline, not a prompt footnote — registry, delivery mode, extract, validate, and repair are separate stages.
One canonical schema per tool version feeds prompts, provider APIs, parsers, and guardrails; duplication is the main source of production drift.
Native JSON schema modes reduce repair cost, but tool envelopes and post-hoc validation still need the same registry and normalize rules.
Harbor CRM cut hard tool failures from 42% to 3.1% by version-pinning vendor APIs and CI-testing schema adapters.
Split large actions into smaller typed steps; mega-schemas and prose handoffs between subagents compound parse failure rates.