Guide

LLM output parsing and validation explained

Harbor Legal's contract review beta asked GPT-4 to return vendor invoices as JSON: line items, tax IDs, payment terms. The prompt said “respond with valid JSON only.” Engineering wired JSON.parse(response) on the raw completion string. Eleven percent of production requests threw SyntaxError — trailing commas, markdown fences, a preamble (“Here is the extracted data:”), or a hallucinated € inside a numeric field. Retries doubled cost; finance analysts re-keyed failed invoices by hand. The model was often semantically right; the pipeline treated formatting noise as total failure.

Output parsing and validation is the boundary between probabilistic text generation and deterministic application code. Native structured-output modes, grammar constraints, and tool schemas reduce but do not eliminate parse risk — you still need fence stripping, schema gates, semantic checks, and repair loops. This guide covers output format taxonomy, multi-layer validation, streaming partial JSON, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist. Pair with structured outputs, function calling, and retry and fallback for the full production stack.

Output format taxonomy

Teams often assume “JSON mode” means parse-once-and-done. In practice, models emit several overlapping shapes. Classify what you expect before building parsers:

Format	Typical use	Parse risk
Raw JSON object/array	Extraction, classification labels	Trailing commas, unquoted keys, comments
Fenced JSON block	Chat completions with markdown habit	Multiple fences, language tag typos, prose outside fence
Tool / function call payload	Agents, API integrations	Wrong tool chosen, partial arguments mid-stream
XML / YAML / CSV	Legacy integrations, tabular exports	Entity escaping, mixed delimiters
Key-value lines	Simple slot filling	Extra colons in values, multiline fields
Structured prose + embedded JSON	Reasoning models, chain-of-thought	JSON buried after explanation paragraphs

Each format needs a dedicated extractor. A single regex for ```json ... ``` fails when the model omits the tag or nests two blocks. Prefer a small state machine: scan for first { or [, track brace depth, slice the balanced substring, then parse.

Validation layers: syntax, schema, semantics

Parsing succeeds when JSON.parse returns an object. Production validation stacks three gates:

Layer 1 — syntactic parse

Strip BOM, normalize smart quotes, remove markdown fences, extract balanced JSON substring. On failure, try a lenient repair library (trailing comma removal, single-quote to double-quote) before escalating. Log the raw completion on every syntax failure — patterns cluster by model version and prompt drift.

Layer 2 — schema validation

Apply JSON Schema, Zod, or Pydantic models. Enforce types, required fields, enums, string formats (ISO dates, ISO 4217 currency codes), and numeric ranges. Reject coercions you did not explicitly allow: a string "12.5%" is not a float without a normalizer. Use additionalProperties: false in strict pipelines so surprise keys surface in QA instead of silently dropping.

Layer 3 — semantic and business rules

Schema validity does not imply correctness. Cross-check: line items sum to stated total within tolerance; tax ID checksum; dates not in the future for historical docs; enum values map to your database IDs. These rules belong in application code, not the prompt — models forget constraints under load.

// Illustrative pipeline (TypeScript)
const raw = completion.choices[0].message.content;
const slice = extractBalancedJson(raw);
const parsed = tryParse(slice) ?? tryRepair(slice);
const dto = InvoiceSchema.parse(parsed);      // Zod / JSON Schema
assertBusinessRules(dto, sourceDocument);     // totals, dates, IDs

Repair loops without burning budget

When validation fails, blind full retries on the same prompt waste tokens. A tiered repair strategy costs less and recovers more:

Deterministic repair — fix fences, trailing commas, strip preamble with heuristics; re-validate. Often recovers 60–80% of syntax errors.
Constrained re-ask — send the invalid output back: “Your JSON failed schema: line_items[2].amount must be number. Return corrected JSON only.” Use a smaller/cheaper model for repair passes.
Field-level fallback — if one array element fails, drop or flag that row instead of rejecting the entire document.
Human queue — after N repair attempts, route to HITL with raw text, parse error, and source PDF attached.

Cap repair attempts (typically 1–2) and total latency budget. Instrument parse_ok, repair_attempted, repair_ok, and hitl_routed per model and prompt version so regressions show up before users file tickets.

Streaming and partial output

Streaming UX wants tokens on screen; downstream systems want complete objects. Strategies:

Buffer-until-valid — accumulate tokens, attempt parse only when brace balance closes or stream ends. Simplest; no partial UI of structured fields.
Incremental JSON parsers — libraries that emit partial objects as keys complete (e.g. partial-json). Useful for progressive form fill; watch for revoked keys if the model self-corrects mid-stream.
Split channels — stream natural-language explanation on SSE channel A; deliver validated JSON on a final data: event or separate non-streaming call. Harbor Legal adopted this: analysts read rationale live; ERP ingests only schema-passing payloads.

Never call JSON.parse on every token delta — it throws and floods logs. Debounce parse attempts or gate on depth balance.

Harbor Legal contract extraction refactor

Problem: JSON.parse on raw completions; no fence stripping; amounts sometimes returned as localized strings; 11% hard failures; retries duplicated full 8k-token context.

Changes shipped:

extractBalancedJson() preprocessor shared across all Legal prompts.
Pydantic InvoiceExtraction model with strict types; currency fields as integer minor units (cents) only.
Post-schema rule: sum(line_items.amount) == total_amount within 1 cent.
Repair pass using a small model with only the invalid JSON + error message (avg 400 tokens vs 8k full retry).
Provider native JSON Schema mode for the primary extraction call; parser kept as defense in depth when vendors drift.
Dead-letter queue in Harbor ops with raw completion, schema diff, and PDF page thumbnails for analyst correction.
Dashboards: parse success rate by model pin, prompt hash, and document locale.

Outcome: hard parse failures fell from 11% to 0.4%; median repair cost dropped 72%; analyst re-keying hours fell from ~14/week to under 2; false schema rejects (valid invoices flagged) tracked separately at 1.1% and fed prompt versioning evals.

Technique decision table

Approach	Best when	Weak when
Native structured output / JSON Schema mode	Supported models, stable schemas, greenfield APIs	Vendor gaps, nested optional arrays, rapid schema churn
Grammar-constrained decoding	Self-hosted inference, must guarantee syntax	Ops complexity, latency, limited grammar expressiveness
Function / tool calling	Agents with multiple actions, router picks tool	Single static schema; tool selection errors
Prompt + post-hoc parse/repair	Multi-provider portability, legacy models	Highest tail failure rate without repair tiers
Regex / line parsers	Fixed templates, 3–5 flat fields	Nested structures, multilingual documents
Blind JSON.parse only	Never in production	Always — demos only

Common pitfalls

Trusting the fence regex — models add prose, multiple blocks, or wrong language tags.
Implicit type coercion — "1,234.56" and "12.5%" pass string type then break math.
Schema drift without versioning — new optional field in prompt but not in Zod model silently drops data.
Full-context retry on parse fail — doubles cost; use targeted repair prompts.
No raw logging — cannot debug whether failure is syntax, schema, or semantics.
Parsing mid-stream — throws on every chunk; use balance gating.
Mixing explanation and data in one channel — split streams or require a dedicated result key.
Ignoring unicode and locale — narrow no-break spaces, fullwidth digits, and decimal commas break naive parsers.

Production checklist

Document expected output format per endpoint (JSON schema ID, tool name).
Shared extractBalancedJson (or equivalent) preprocessor in codebase.
Schema validation library with strict types and additionalProperties policy.
Business-rule validation layer separate from schema.
Tiered repair: deterministic fix, then small-model re-ask, then HITL queue.
Cap repair attempts and wall-clock timeout per request.
Log raw_completion_hash, parse stage, schema errors (no PII in logs).
Metrics: parse_ok, repair_ok, schema_reject, semantic_reject, hitl_routed.
Streaming policy: buffer-until-valid or incremental parser with tests.
Golden-file eval set with intentionally dirty outputs (fences, commas, preamble).
Prompt version pin tied to schema version in observability traces.
Fallback path when provider structured mode unavailable (feature flag).

Key takeaways

Structured output modes reduce but do not eliminate parse risk — keep a defense-in-depth parser.
Validate in layers: syntax, schema, then business semantics.
Repair loops should be cheap and targeted, not full prompt retries.
Log raw failures (safely) so prompt and model regressions are visible.
Streaming and structured data need an explicit contract — do not parse every token delta.