Guide
LLM agent structured output and JSON schema pipeline systems explained
Harbor CRM shipped a sales assistant that drafted emails, logged calls, and
created contacts through a
function-calling
loop. Demos were smooth: the model returned polite prose and a JSON blob with
first_name, last_name, and account_id.
Two weeks after launch, Salesforce rolled API v58 and renamed
AccountId to Account__c on a custom object path the
agent used. Nobody updated the prompt examples. Tool invocations began failing
with HTTP 400 — 42% of create_contact steps
in the first sprint. Support blamed “the AI hallucinating,” but
traces showed the model was consistent: it emitted JSON that matched the
old schema copy-pasted into the system prompt. The runtime had no
single source of truth, no version pin, and no repair path when parse or
schema validation failed. After Harbor rebuilt around a
structured output pipeline — registry, provider mode
selection, normalize stage, and capped repair loops — hard tool failures
fell to 3.1% and mean time to adapt a vendor schema change
dropped from nine days to under four hours.
Agent structured output is not the same as asking the model to “respond in JSON.” Production pipelines treat schemas as contracts: one registry feeds the prompt, the provider API, the parser, and downstream output validation gates. This guide covers registry design, delivery modes (native JSON schema, tool envelopes, grammar constraints), the parse-normalize-repair chain, multi-step schema handoffs, the Harbor CRM refactor, a decision table versus prompt-only JSON and post-hoc regex extraction, implementation pitfalls, and a checklist for teams wiring tool schemas into reliable agent runtimes.
What a schema pipeline does in an agent loop
Every agent step that triggers code needs machine-readable output with predictable shape. That includes explicit tool calls, implicit “action JSON” blocks, planner DAGs, and intermediate plans passed between subagents. A schema pipeline answers four questions on every generation:
- Which schema applies? Tool name, step type, tenant, API version.
- How is it conveyed to the model? Provider-native structured mode, tool definition, or appended JSON Schema in the user message.
- How is raw text turned into a typed object? Extract, parse, normalize types, reject duplicates and trailing prose.
- What happens on failure? Structured error back to the model, fallback model, or human escalation — never silent coercion.
Without a pipeline, teams duplicate schema fragments across prompts, OpenAPI
stubs, and Pydantic models. Drift is inevitable. The pipeline’s core artifact
is a schema registry: versioned JSON Schema (or equivalent)
documents keyed by (tool_id, api_version, tenant_tier), rendered
into whatever each provider expects at call time.
Schema registry design
Treat the registry like an API catalog, not prompt documentation:
- Stable IDs —
crm.create_contact@v3survives field renames inside the document viaversionbumps. - Provider views — one canonical schema; adapters emit
OpenAI
response_format, Anthropic tool blocks, or local Zod types. - Compatibility matrix — which models support strict JSON schema mode vs grammar-only vs post-hoc validation.
- Changelog — machine-readable diff between v2 and v3 injected into repair prompts when validation fails after a deploy.
- Size budgets — large nested schemas blow context; project fields per step via dynamic tool exposure.
Harbor pinned salesforce_api=58 on each run record. When v59
shipped, they added crm.create_contact@v4, ran shadow traffic
for 48 hours, and flipped the pin only after parse success rate matched v3
on golden transcripts. Prompt writers no longer edit JSON by hand.
Schema authoring rules
Follow the same discipline as
tool schema design:
flat objects where possible, explicit enums instead of free strings, required
arrays for list fields, additionalProperties: false on strict
paths, and descriptions that double as repair hints (“must be 18-char
Salesforce ID from search_accounts result”). Avoid
polymorphic oneOf trees the model confuses unless you split into
separate tools.
Provider delivery modes
Three families dominate production agents in 2026. Pick per step based on model support and latency budget:
| Mode | Mechanism | Strength | Weakness |
|---|---|---|---|
| Native JSON / structured outputs | API constrains tokens to schema (logits mask or grammar) | Highest parse success; fewer repair tokens | Model-specific; schema size limits; not all fields on all tiers |
| Tool / function envelope | Model emits tool_calls[] with arguments string |
Natural fit for multi-tool loops; parallel calls | Arguments still need parse + validate; parallel partial failures |
| Prompt + post-hoc validation | Schema in prompt; regex or JSON.parse after | Works on any model | Lowest reliability; expensive repair; markdown leakage |
See structured outputs explained for provider feature comparison. Agent pipelines often combine modes: planner step uses strict JSON mode for a small plan object; executor step uses tool envelopes for side effects. Never mix two schemas in one unconstrained generation — the model will blend keys.
Grammar and constrained decoding
When native schema mode is unavailable, grammar-based decoders (GBNF, regex automata) guarantee syntactic validity but not semantic correctness. Still run registry validation after decode. Grammar shines for fixed micro-formats (ISO dates, ticket IDs) embedded in larger objects.
Parse, normalize, and repair
Even with constrained decoding, production paths need a deterministic extraction stage:
raw_completion → extract_payload → json_parse → schema_validate → [commit | repair | escalate]
Extract
Priority order: tool-call arguments if present; else provider
parsed field; else fenced ```json block; else first
balanced {...} with depth counter. Reject multiple JSON roots.
Log extraction method per span for quality dashboards.
Normalize
Coerce only when schema explicitly allows: trim strings, map legacy aliases
(account_id → Account__c) via registry migration
rules, drop unknown keys if policy permits. Do not silently fill
required fields with null — that hides model failure until the API 400s.
Repair loop
On schema_validate failure, inject validator errors verbatim
(JSON Pointer paths help). Cap at 2–3 attempts per step;
charge repair tokens against
context budget.
For tool argument errors, prefer re-issuing the tool schema snippet from the
registry over paraphrasing in natural language. If repair exhausts, return a
structured observation to the agent loop per
tool error handling
patterns — do not retry the same temperature hoping for luck.
Multi-step schema chaining
Complex workflows pass typed objects between steps: classifier → router → executor. Each hop should have its own registry entry and smaller schema. Anti-patterns:
- One mega-schema with 40 optional fields — the model omits required nested objects randomly.
- Re-serializing tool results as unstructured prose between steps — re-parse errors compound.
- Subagents returning free-text summaries to a parent that expects JSON — force structured handoff packages.
Harbor split create_contact into plan_contact
(strict JSON: fields to collect) and commit_contact (tool call
only after plan validates). Plan steps never touch Salesforce; commit steps
never invent fields not in the plan object. Failure isolation improved because
repair prompts referenced half-sized schemas.
Harbor CRM refactor
Before: schemas lived in a Notion doc, copied into prompts weekly. After:
- Git-backed registry with CI tests: every schema must round-trip through provider adapters and validate golden completions.
- Run metadata
schema_bundle_hashon each trace for replay. - Shadow mode on schema bumps: validate production completions against vNext without executing tools.
- Semantic policy layer after schema gate (account ID must exist in session cache) — delegated to guardrails, not duplicated in prompts.
- Alert when
extract_method=markdown_fencerate rises — signals model drift off tool mode.
Hard failures (HTTP 4xx from tools due to shape mismatch) went from 42% to 3.1% of commit steps. Remaining failures were true business rule rejections caught by policy, not parse noise.
Decision table: pipeline vs alternatives
| Approach | When it fits | Production risk |
|---|---|---|
| Centralized schema pipeline (this guide) | Multi-tool agents, vendor API integrations, regulated writes | Low when registry is CI-tested and version-pinned |
| Prompt-only “return JSON” | Prototypes, read-only demos, single-field extractions | High — markdown wrappers, key drift, no version story |
| Regex / LLM-as-parser second pass | Legacy models without structured modes | Medium — doubles latency and cost; parser can hallucinate |
| Guardrails only (no registry) | Adding policy on top of existing ad-hoc JSON | Medium — validation without a single schema source still drifts |
Pipelines complement guardrails: schema proves shape; policy proves truth. Pipelines complement reflection loops: critique prose quality; schemas gate executable payloads.
Common pitfalls
- Duplicating schema in prompt and tool definition — they will diverge; generate prompt snippets from the registry.
- Over-strict
additionalProperties: falseon evolving APIs — breaks forward compatibility; version instead of forbidding all extras. - Repair loops without error pointers — “invalid JSON” wastes tokens; pass AJV / Pydantic paths.
- Coercing invalid enums to defaults — masks hallucination; reject and repair.
- Logging full payloads with PII — redact before trace export; schema paths are enough for debug.
- Skipping extraction metrics — rising fence-extraction rate means tool mode regressed before users notice.
Production checklist
- Define canonical JSON Schema per tool/step with versioned IDs in a git-backed registry.
- Generate provider tool definitions and prompt fragments from the registry in CI.
- Pin external API version on each run; document bump procedure with shadow validation.
- Select delivery mode per step: native structured > tool envelope > prompt-only.
- Implement extract → parse → schema_validate with logged extraction method.
- Cap repair attempts; feed validator errors with JSON Pointer paths.
- Wire schema failures into structured tool-error observations for the agent loop.
- Attach
schema_bundle_hashto traces for replay and incident triage. - Alert on parse-failure rate and markdown-fence extraction spikes per tool.
- Hand off validated objects to semantic policy gates before irreversible commits.
Key takeaways
- Agent structured output needs a pipeline, not a prompt footnote — registry, delivery mode, extract, validate, and repair are separate stages.
- One canonical schema per tool version feeds prompts, provider APIs, parsers, and guardrails; duplication is the main source of production drift.
- Native JSON schema modes reduce repair cost, but tool envelopes and post-hoc validation still need the same registry and normalize rules.
- Harbor CRM cut hard tool failures from 42% to 3.1% by version-pinning vendor APIs and CI-testing schema adapters.
- Split large actions into smaller typed steps; mega-schemas and prose handoffs between subagents compound parse failure rates.
Related reading
- LLM agent guardrails and output validation explained — semantic policy after schema gates
- LLM function calling explained — tool envelopes and parallel invocation
- LLM structured outputs explained — provider-native JSON modes and constraints
- LLM tool schema design explained — authoring schemas models can follow