Guide
LLM output parsing and validation explained
Harbor Legal's contract review beta asked GPT-4 to return vendor invoices as JSON:
line items, tax IDs, payment terms. The prompt said “respond with valid JSON
only.” Engineering wired JSON.parse(response) on the raw completion
string. Eleven percent of production requests threw SyntaxError —
trailing commas, markdown fences, a preamble (“Here is the extracted data:”),
or a hallucinated € inside a numeric field. Retries doubled cost; finance
analysts re-keyed failed invoices by hand. The model was often semantically
right; the pipeline treated formatting noise as total failure.
Output parsing and validation is the boundary between probabilistic text generation and deterministic application code. Native structured-output modes, grammar constraints, and tool schemas reduce but do not eliminate parse risk — you still need fence stripping, schema gates, semantic checks, and repair loops. This guide covers output format taxonomy, multi-layer validation, streaming partial JSON, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist. Pair with structured outputs, function calling, and retry and fallback for the full production stack.
Output format taxonomy
Teams often assume “JSON mode” means parse-once-and-done. In practice, models emit several overlapping shapes. Classify what you expect before building parsers:
| Format | Typical use | Parse risk |
|---|---|---|
| Raw JSON object/array | Extraction, classification labels | Trailing commas, unquoted keys, comments |
| Fenced JSON block | Chat completions with markdown habit | Multiple fences, language tag typos, prose outside fence |
| Tool / function call payload | Agents, API integrations | Wrong tool chosen, partial arguments mid-stream |
| XML / YAML / CSV | Legacy integrations, tabular exports | Entity escaping, mixed delimiters |
| Key-value lines | Simple slot filling | Extra colons in values, multiline fields |
| Structured prose + embedded JSON | Reasoning models, chain-of-thought | JSON buried after explanation paragraphs |
Each format needs a dedicated extractor. A single regex for
```json ... ``` fails when the model omits the tag or nests two blocks.
Prefer a small state machine: scan for first { or [, track
brace depth, slice the balanced substring, then parse.
Validation layers: syntax, schema, semantics
Parsing succeeds when JSON.parse returns an object. Production
validation stacks three gates:
Layer 1 — syntactic parse
Strip BOM, normalize smart quotes, remove markdown fences, extract balanced JSON substring. On failure, try a lenient repair library (trailing comma removal, single-quote to double-quote) before escalating. Log the raw completion on every syntax failure — patterns cluster by model version and prompt drift.
Layer 2 — schema validation
Apply JSON Schema, Zod, or Pydantic models. Enforce types, required fields, enums,
string formats (ISO dates, ISO 4217 currency codes), and numeric ranges. Reject
coercions you did not explicitly allow: a string "12.5%" is not
a float without a normalizer. Use additionalProperties: false in strict
pipelines so surprise keys surface in QA instead of silently dropping.
Layer 3 — semantic and business rules
Schema validity does not imply correctness. Cross-check: line items sum to stated total within tolerance; tax ID checksum; dates not in the future for historical docs; enum values map to your database IDs. These rules belong in application code, not the prompt — models forget constraints under load.
// Illustrative pipeline (TypeScript)
const raw = completion.choices[0].message.content;
const slice = extractBalancedJson(raw);
const parsed = tryParse(slice) ?? tryRepair(slice);
const dto = InvoiceSchema.parse(parsed); // Zod / JSON Schema
assertBusinessRules(dto, sourceDocument); // totals, dates, IDs
Repair loops without burning budget
When validation fails, blind full retries on the same prompt waste tokens. A tiered repair strategy costs less and recovers more:
- Deterministic repair — fix fences, trailing commas, strip preamble with heuristics; re-validate. Often recovers 60–80% of syntax errors.
- Constrained re-ask — send the invalid output back:
“Your JSON failed schema:
line_items[2].amountmust be number. Return corrected JSON only.” Use a smaller/cheaper model for repair passes. - Field-level fallback — if one array element fails, drop or flag that row instead of rejecting the entire document.
- Human queue — after N repair attempts, route to HITL with raw text, parse error, and source PDF attached.
Cap repair attempts (typically 1–2) and total latency budget. Instrument
parse_ok, repair_attempted, repair_ok, and
hitl_routed per model and prompt version so regressions show up before
users file tickets.
Streaming and partial output
Streaming UX wants tokens on screen; downstream systems want complete objects. Strategies:
- Buffer-until-valid — accumulate tokens, attempt parse only when brace balance closes or stream ends. Simplest; no partial UI of structured fields.
- Incremental JSON parsers — libraries that emit partial
objects as keys complete (e.g.
partial-json). Useful for progressive form fill; watch for revoked keys if the model self-corrects mid-stream. - Split channels — stream natural-language explanation on
SSE channel A; deliver validated JSON on a final
data:event or separate non-streaming call. Harbor Legal adopted this: analysts read rationale live; ERP ingests only schema-passing payloads.
Never call JSON.parse on every token delta — it throws and floods
logs. Debounce parse attempts or gate on depth balance.
Harbor Legal contract extraction refactor
Problem: JSON.parse on raw completions; no fence stripping; amounts
sometimes returned as localized strings; 11% hard failures; retries duplicated full
8k-token context.
Changes shipped:
extractBalancedJson()preprocessor shared across all Legal prompts.- Pydantic
InvoiceExtractionmodel with strict types; currency fields as integer minor units (cents) only. - Post-schema rule:
sum(line_items.amount) == total_amountwithin 1 cent. - Repair pass using a small model with only the invalid JSON + error message (avg 400 tokens vs 8k full retry).
- Provider native JSON Schema mode for the primary extraction call; parser kept as defense in depth when vendors drift.
- Dead-letter queue in Harbor ops with raw completion, schema diff, and PDF page thumbnails for analyst correction.
- Dashboards: parse success rate by model pin, prompt hash, and document locale.
Outcome: hard parse failures fell from 11% to 0.4%; median repair cost dropped 72%; analyst re-keying hours fell from ~14/week to under 2; false schema rejects (valid invoices flagged) tracked separately at 1.1% and fed prompt versioning evals.
Technique decision table
| Approach | Best when | Weak when |
|---|---|---|
| Native structured output / JSON Schema mode | Supported models, stable schemas, greenfield APIs | Vendor gaps, nested optional arrays, rapid schema churn |
| Grammar-constrained decoding | Self-hosted inference, must guarantee syntax | Ops complexity, latency, limited grammar expressiveness |
| Function / tool calling | Agents with multiple actions, router picks tool | Single static schema; tool selection errors |
| Prompt + post-hoc parse/repair | Multi-provider portability, legacy models | Highest tail failure rate without repair tiers |
| Regex / line parsers | Fixed templates, 3–5 flat fields | Nested structures, multilingual documents |
| Blind JSON.parse only | Never in production | Always — demos only |
Common pitfalls
- Trusting the fence regex — models add prose, multiple blocks, or wrong language tags.
- Implicit type coercion —
"1,234.56"and"12.5%"pass string type then break math. - Schema drift without versioning — new optional field in prompt but not in Zod model silently drops data.
- Full-context retry on parse fail — doubles cost; use targeted repair prompts.
- No raw logging — cannot debug whether failure is syntax, schema, or semantics.
- Parsing mid-stream — throws on every chunk; use balance gating.
- Mixing explanation and data in one channel — split streams or require a dedicated
resultkey. - Ignoring unicode and locale — narrow no-break spaces, fullwidth digits, and decimal commas break naive parsers.
Production checklist
- Document expected output format per endpoint (JSON schema ID, tool name).
- Shared
extractBalancedJson(or equivalent) preprocessor in codebase. - Schema validation library with strict types and
additionalPropertiespolicy. - Business-rule validation layer separate from schema.
- Tiered repair: deterministic fix, then small-model re-ask, then HITL queue.
- Cap repair attempts and wall-clock timeout per request.
- Log
raw_completion_hash, parse stage, schema errors (no PII in logs). - Metrics: parse_ok, repair_ok, schema_reject, semantic_reject, hitl_routed.
- Streaming policy: buffer-until-valid or incremental parser with tests.
- Golden-file eval set with intentionally dirty outputs (fences, commas, preamble).
- Prompt version pin tied to schema version in observability traces.
- Fallback path when provider structured mode unavailable (feature flag).
Key takeaways
- Structured output modes reduce but do not eliminate parse risk — keep a defense-in-depth parser.
- Validate in layers: syntax, schema, then business semantics.
- Repair loops should be cheap and targeted, not full prompt retries.
- Log raw failures (safely) so prompt and model regressions are visible.
- Streaming and structured data need an explicit contract — do not parse every token delta.
Related reading
- LLM structured outputs explained — JSON Schema modes and provider APIs
- LLM grammar-constrained decoding explained — syntax guarantees at generation time
- LLM function calling explained — tool argument payloads and routing
- LLM observability explained — tracing parse failures in production