Guide

LLM agent guardrails and output validation explained

Harbor Benefits launched an enrollment agent to help employees pick health plans during open season. The ReAct loop called enroll_member with JSON arguments parsed from model output. QA reviewers loved the conversational tone — empathetic, clear, on-brand. What nobody tested systematically was whether the structured fields matched carrier rate tables. In the first production week, 19% of completed enrollments carried at least one incorrect deductible, out-of-pocket maximum, or premium amount. HR fielded 340 correction tickets; two members hit in-network claims with wrong cost-share until plans were manually amended. Root cause: the runtime trusted the model’s JSON because it “looked valid” — keys were present, types were roughly right, but values were hallucinated off stale training priors. Prompts said “use only the rate sheet tool,” yet nothing enforced that linkage before the write committed.

Output validation guardrails sit between model generation and irreversible side effects. They combine syntactic checks (JSON Schema, required fields), semantic rules (deductible must match plan ID from the rate API), content safety filters, and escalation when repair fails. This guide covers the validation stack, validate-repair loops, speculative vs committed execution, integration with permission scoping and structured outputs, the Harbor Benefits refactor, a technique decision table, pitfalls, and a checklist.

What agent guardrails validate

A tool-using agent emits two output classes: natural language shown to users and machine-readable payloads that drive tools, APIs, and databases. Guardrails must cover both, but production incidents cluster on the second — because code executes structured output even when prose sounds confident.

Typical validation layers, from cheapest to most expensive:

  1. Syntax — valid JSON, UTF-8, size caps, no trailing garbage after the payload.
  2. Schema — types, enums, required keys, numeric ranges per tool schema design.
  3. Semantic policy — business rules the schema cannot express: “premium must equal rate_table[plan_id].employee_contribution,” “refund cannot exceed original charge,” “diagnosis code must exist in ICD-10 subset.”
  4. Provenance — field values must cite a tool result ID or database row the agent fetched this session.
  5. Content safety — PII leakage, toxic content, jailbreak patterns in user-visible text.
  6. Downstream simulation — dry-run the mutation in a sandbox and diff expected state.

Run layers in order. Fail fast on syntax before invoking a heavy policy engine. Log which layer rejected output — that metric drives prompt and schema iteration faster than aggregate “agent failed” counters.

Validation pipeline architecture

Treat validation as a state machine on every agent step, not a one-off post-processor:

generate → parse → schema_validate → policy_validate → [repair | commit | escalate]

Parse and normalize

Extract JSON from model output (raw JSON mode, fenced blocks, or function-call envelopes). Normalize: strip whitespace, coerce string numbers only when schema allows, reject duplicate keys. If parsing fails, return a structured error observation to the tool error channel — never silently retry with regex hacks on production paths.

Schema gate

Validate against the same JSON Schema exposed to the model (OpenAI structured outputs, Anthropic tool schemas, or local Pydantic/Zod models). Mismatches include wrong enum, missing nested object, float where integer required. See structured outputs explained for provider-native vs post-hoc validation trade-offs.

Policy gate

Schema proves shape; policy proves truth. Implement as pure functions over (payload, session_context, tool_results_cache). Example Harbor rule:

assert payload.deductible == cache.rate_sheet[payload.plan_id].deductible

Policy functions should be unit-tested without the LLM — same discipline as payment validation in any backend service.

Repair loop

On schema failure, feed the validator error message back to the model with the invalid payload (max 2–3 attempts). On policy failure, prefer re-fetching source data via tool than asking the model to guess the correct number. Cap total repair tokens per step in context budget policy to prevent infinite fix loops.

Commit vs escalate

If repair exhausts, block the side effect and escalate: human queue, safer fallback response, or read-only mode. Never “best effort commit” on financial, medical, or infra mutations.

Guardrails vs permissions vs sandboxing

ControlQuestion it answersExample
Permission scopingMay the agent call this tool at all?enroll_member not visible to read-only support bot
Approval gatesMust a human approve before execution?Tier-2 gate on any payroll mutation
Output validationIs this specific payload correct and safe?Deductible matches rate sheet for plan_id
Sandbox executionWhat happens if we dry-run first?Staging enrollment API with fake member IDs

All four stack. Harbor’s incident had permissions (only HR role could enroll) but no output validation — authorized agents still shipped bad data. Pair validation with sandbox execution for new tools: validate in shadow mode until block rate stabilizes below your SLO.

Content and injection guardrails

Structured validation does not stop prompt injection in user-visible replies or tool arguments derived from untrusted text. Layer defenses:

  • Input isolation — wrap user content in clear delimiters; never concatenate into system prompts without encoding.
  • Output classifiers — lightweight models or regex for PII patterns, instruction-leak phrases, markdown exfiltration.
  • Tool argument allowlists — URLs must match internal domain list; SQL fragments rejected by parser.
  • Canary tokens in system prompts; alert if they appear in outbound text.

Cross-read prompt injection defense for input-side patterns. Output guardrails are the last line when injection steers the model toward malicious tool args.

Observability and metrics

Instrument every rejection with:

  • validation_layer (syntax | schema | policy | content)
  • rule_id for policy failures
  • repair_attempt count
  • final_outcome (committed | repaired | blocked | escalated)

Dashboard block rate by tool name. A spike on enroll_member policy RATE_MISMATCH signals stale cache or carrier file drift — not necessarily a bad model. Tie spans to agent tracing so on-call can replay the session with deterministic replay.

Red flags when guardrails are missing or weak

  • High user satisfaction, high backend correction rate — prose quality masks structured errors.
  • Prompt-only safety language (“always verify with the API”) with no runtime check.
  • Schema validates types but not cross-field constraints — start_date after end_date passes JSON Schema.
  • Repair loops without attempt caps — cost blowups and latency tail.
  • Validation only in offline eval — production tool args bypass the same code path.
  • User-visible text committed before tool validation — employee told wrong deductible before write blocked.

Harbor Benefits refactor: 19% to 0.8% bad enrollments

Harbor’s fix was runtime enforcement, not prompt tuning alone. Week 1: froze auto-commit; all enrollments queued for human review while building validators. Week 2: strict JSON Schema on enroll_member aligned with carrier API; added policy module comparing every financial field to get_rate_sheet tool results cached by plan_id. Week 3: provenance rule — each monetary field required source_quote_id from the rate tool response. Week 4: validate-repair loop (max 2 attempts) with explicit validator errors; block-and-escalate to HR queue on failure. Week 5–6: shadow mode on 20% traffic, then full rollout with weekly policy unit tests when carriers publish new tables.

Outcomes: bad enrollment rate 19% → 0.8% (remaining cases were carrier file lag, caught by provenance); median step latency +180 ms (acceptable for enrollment); repair success on first retry 74%; escalation queue 3.2% of sessions. Correction tickets fell from 340/week to under 15.

Technique decision table

ApproachBest forWeak when
Prompt-only instructionsLow-stakes copy, internal draftsAny financial, medical, or infra mutation
JSON Schema / structured outputsType-safe tool args, parser eliminationCross-field business rules and provenance
Policy engine on tool argsEnrollment, billing, access control writesFree-form creative text quality
Validate-repair loopRecoverable schema typos, enum slipsPolicy violations needing new data fetch
Sandbox dry-runNew tools, complex multi-table writesLatency-sensitive chat paths
Human escalation on blockRegulated domains, low-volume high-riskHigh-throughput automation without staff

Common pitfalls

  • Duplicated schemas — model sees one schema, validator uses another; drift guaranteed.
  • Validating after side effect — guardrails must run pre-commit on the hot path.
  • Over-trusting provider structured output — reduces syntax errors, not semantic truth.
  • Silent coercion — parsing "1500" as number hides model confusion; fail closed when strict types matter.
  • No block metric — teams disable validators when they “slow things down” without seeing what they caught.
  • Announcing success before validation — user hears “you’re enrolled” then rollback creates distrust.
  • Policy logic in prompts — untestable, changes every model version.

Production checklist

  • Map every write tool to a validation pipeline (parse → schema → policy).
  • Single source of truth for JSON Schema shared by model API and runtime.
  • Unit-test policy rules with fixtures — no LLM in CI for rule coverage.
  • Cap repair attempts and token budget per step.
  • Block-and-escalate path tested end-to-end monthly.
  • Provenance fields for any value that must match an external system.
  • Dashboard: block rate, layer, rule_id, repair success, escalation volume.
  • Shadow validation on new carrier/API versions before cutover.
  • User-visible messaging only after commit succeeds (or clearly mark pending).
  • Replay failing sessions with deterministic cassettes for regression.

Key takeaways

  • Guardrails validate machine-readable output before side effects — polite prose does not substitute.
  • Stack syntax, schema, policy, and provenance layers — each catches errors the previous misses.
  • Repair loops help schema slips; policy failures need data or humans — do not infinite-retry guesses.
  • Permissions and validation are complementary — authorized agents still need correct payloads.
  • Instrument block rate by rule — Harbor cut bad enrollments from 19% to 0.8% with runtime policy, not prompts alone.

Related reading