Guide
LLM agent guardrails and output validation explained
Harbor Benefits launched an enrollment agent to help employees pick health
plans during open season. The
ReAct loop
called enroll_member with JSON arguments parsed from model
output. QA reviewers loved the conversational tone — empathetic,
clear, on-brand. What nobody tested systematically was whether the
structured fields matched carrier rate tables. In the
first production week, 19% of completed enrollments
carried at least one incorrect deductible, out-of-pocket maximum, or
premium amount. HR fielded 340 correction tickets;
two members hit in-network claims with wrong cost-share until plans were
manually amended. Root cause: the runtime trusted the model’s JSON
because it “looked valid” — keys were present, types
were roughly right, but values were hallucinated off stale training
priors. Prompts said “use only the rate sheet tool,” yet
nothing enforced that linkage before the write committed.
Output validation guardrails sit between model generation and irreversible side effects. They combine syntactic checks (JSON Schema, required fields), semantic rules (deductible must match plan ID from the rate API), content safety filters, and escalation when repair fails. This guide covers the validation stack, validate-repair loops, speculative vs committed execution, integration with permission scoping and structured outputs, the Harbor Benefits refactor, a technique decision table, pitfalls, and a checklist.
What agent guardrails validate
A tool-using agent emits two output classes: natural language shown to users and machine-readable payloads that drive tools, APIs, and databases. Guardrails must cover both, but production incidents cluster on the second — because code executes structured output even when prose sounds confident.
Typical validation layers, from cheapest to most expensive:
- Syntax — valid JSON, UTF-8, size caps, no trailing garbage after the payload.
- Schema — types, enums, required keys, numeric ranges per tool schema design.
- Semantic policy — business rules the schema cannot express: “premium must equal rate_table[plan_id].employee_contribution,” “refund cannot exceed original charge,” “diagnosis code must exist in ICD-10 subset.”
- Provenance — field values must cite a tool result ID or database row the agent fetched this session.
- Content safety — PII leakage, toxic content, jailbreak patterns in user-visible text.
- Downstream simulation — dry-run the mutation in a sandbox and diff expected state.
Run layers in order. Fail fast on syntax before invoking a heavy policy engine. Log which layer rejected output — that metric drives prompt and schema iteration faster than aggregate “agent failed” counters.
Validation pipeline architecture
Treat validation as a state machine on every agent step, not a one-off post-processor:
generate → parse → schema_validate → policy_validate → [repair | commit | escalate]
Parse and normalize
Extract JSON from model output (raw JSON mode, fenced blocks, or function-call envelopes). Normalize: strip whitespace, coerce string numbers only when schema allows, reject duplicate keys. If parsing fails, return a structured error observation to the tool error channel — never silently retry with regex hacks on production paths.
Schema gate
Validate against the same JSON Schema exposed to the model (OpenAI structured outputs, Anthropic tool schemas, or local Pydantic/Zod models). Mismatches include wrong enum, missing nested object, float where integer required. See structured outputs explained for provider-native vs post-hoc validation trade-offs.
Policy gate
Schema proves shape; policy proves truth. Implement as pure functions over (payload, session_context, tool_results_cache). Example Harbor rule:
assert payload.deductible == cache.rate_sheet[payload.plan_id].deductible
Policy functions should be unit-tested without the LLM — same discipline as payment validation in any backend service.
Repair loop
On schema failure, feed the validator error message back to the model with the invalid payload (max 2–3 attempts). On policy failure, prefer re-fetching source data via tool than asking the model to guess the correct number. Cap total repair tokens per step in context budget policy to prevent infinite fix loops.
Commit vs escalate
If repair exhausts, block the side effect and escalate: human queue, safer fallback response, or read-only mode. Never “best effort commit” on financial, medical, or infra mutations.
Guardrails vs permissions vs sandboxing
| Control | Question it answers | Example |
|---|---|---|
| Permission scoping | May the agent call this tool at all? | enroll_member not visible to read-only support bot |
| Approval gates | Must a human approve before execution? | Tier-2 gate on any payroll mutation |
| Output validation | Is this specific payload correct and safe? | Deductible matches rate sheet for plan_id |
| Sandbox execution | What happens if we dry-run first? | Staging enrollment API with fake member IDs |
All four stack. Harbor’s incident had permissions (only HR role could enroll) but no output validation — authorized agents still shipped bad data. Pair validation with sandbox execution for new tools: validate in shadow mode until block rate stabilizes below your SLO.
Content and injection guardrails
Structured validation does not stop prompt injection in user-visible replies or tool arguments derived from untrusted text. Layer defenses:
- Input isolation — wrap user content in clear delimiters; never concatenate into system prompts without encoding.
- Output classifiers — lightweight models or regex for PII patterns, instruction-leak phrases, markdown exfiltration.
- Tool argument allowlists — URLs must match internal domain list; SQL fragments rejected by parser.
- Canary tokens in system prompts; alert if they appear in outbound text.
Cross-read prompt injection defense for input-side patterns. Output guardrails are the last line when injection steers the model toward malicious tool args.
Observability and metrics
Instrument every rejection with:
validation_layer(syntax | schema | policy | content)rule_idfor policy failuresrepair_attemptcountfinal_outcome(committed | repaired | blocked | escalated)
Dashboard block rate by tool name. A spike on enroll_member
policy RATE_MISMATCH signals stale cache or carrier file
drift — not necessarily a bad model. Tie spans to
agent tracing
so on-call can replay the session with
deterministic replay.
Red flags when guardrails are missing or weak
- High user satisfaction, high backend correction rate — prose quality masks structured errors.
- Prompt-only safety language (“always verify with the API”) with no runtime check.
- Schema validates types but not cross-field constraints — start_date after end_date passes JSON Schema.
- Repair loops without attempt caps — cost blowups and latency tail.
- Validation only in offline eval — production tool args bypass the same code path.
- User-visible text committed before tool validation — employee told wrong deductible before write blocked.
Harbor Benefits refactor: 19% to 0.8% bad enrollments
Harbor’s fix was runtime enforcement, not prompt tuning alone.
Week 1: froze auto-commit; all enrollments queued for
human review while building validators.
Week 2: strict JSON Schema on enroll_member
aligned with carrier API; added policy module comparing every financial
field to get_rate_sheet tool results cached by plan_id.
Week 3: provenance rule — each monetary field
required source_quote_id from the rate tool response.
Week 4: validate-repair loop (max 2 attempts) with
explicit validator errors; block-and-escalate to HR queue on failure.
Week 5–6: shadow mode on 20% traffic, then full rollout
with weekly policy unit tests when carriers publish new tables.
Outcomes: bad enrollment rate 19% → 0.8% (remaining cases were carrier file lag, caught by provenance); median step latency +180 ms (acceptable for enrollment); repair success on first retry 74%; escalation queue 3.2% of sessions. Correction tickets fell from 340/week to under 15.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| Prompt-only instructions | Low-stakes copy, internal drafts | Any financial, medical, or infra mutation |
| JSON Schema / structured outputs | Type-safe tool args, parser elimination | Cross-field business rules and provenance |
| Policy engine on tool args | Enrollment, billing, access control writes | Free-form creative text quality |
| Validate-repair loop | Recoverable schema typos, enum slips | Policy violations needing new data fetch |
| Sandbox dry-run | New tools, complex multi-table writes | Latency-sensitive chat paths |
| Human escalation on block | Regulated domains, low-volume high-risk | High-throughput automation without staff |
Common pitfalls
- Duplicated schemas — model sees one schema, validator uses another; drift guaranteed.
- Validating after side effect — guardrails must run pre-commit on the hot path.
- Over-trusting provider structured output — reduces syntax errors, not semantic truth.
- Silent coercion — parsing
"1500"as number hides model confusion; fail closed when strict types matter. - No block metric — teams disable validators when they “slow things down” without seeing what they caught.
- Announcing success before validation — user hears “you’re enrolled” then rollback creates distrust.
- Policy logic in prompts — untestable, changes every model version.
Production checklist
- Map every write tool to a validation pipeline (parse → schema → policy).
- Single source of truth for JSON Schema shared by model API and runtime.
- Unit-test policy rules with fixtures — no LLM in CI for rule coverage.
- Cap repair attempts and token budget per step.
- Block-and-escalate path tested end-to-end monthly.
- Provenance fields for any value that must match an external system.
- Dashboard: block rate, layer, rule_id, repair success, escalation volume.
- Shadow validation on new carrier/API versions before cutover.
- User-visible messaging only after commit succeeds (or clearly mark pending).
- Replay failing sessions with deterministic cassettes for regression.
Key takeaways
- Guardrails validate machine-readable output before side effects — polite prose does not substitute.
- Stack syntax, schema, policy, and provenance layers — each catches errors the previous misses.
- Repair loops help schema slips; policy failures need data or humans — do not infinite-retry guesses.
- Permissions and validation are complementary — authorized agents still need correct payloads.
- Instrument block rate by rule — Harbor cut bad enrollments from 19% to 0.8% with runtime policy, not prompts alone.
Related reading
- Structured outputs explained — JSON Schema, provider modes, validate-repair
- Agent permission scoping explained — least privilege and approval gates
- Tool error handling explained — structured observations and recovery
- Prompt injection defense explained — input-side hardening for agent apps