Guide

LLM agent reflection and self-critique verification loops explained

Harbor Compliance ships a contract-review agent that drafts redline summaries with clause citations pulled from uploaded PDFs. The first production build ran a single model pass: generate answer, stream to counsel, done. Auditors found that 29% of runs cited the wrong section number, paraphrased obligations that did not exist in the source, or mixed language from two exhibits. Legal escalations spiked because attorneys trusted the citations without re-reading every page. After engineering added a draft-critic verification loop — a structured critique pass with mandatory source-span checks, coordinated with output validation gates and grounding verifiers — citation-error rate fell to 3.1% while mean latency rose only 1.4× (two passes, not five).

Reflection loops are the middleware pattern where an agent (or a dedicated critic model) reviews its own draft output against explicit criteria before the user sees it — distinct from planning and decomposition (which decide what to do) and from offline evaluation suites (which score runs after the fact). Single-pass generation is fast but brittle on tasks where small errors are expensive: legal citations, financial figures, SQL that mutates production data, and outbound emails. This guide covers reflection taxonomy, pipeline architecture, iteration budgets, Harbor Compliance’s refactor, a technique decision table, pitfalls, and a production checklist.

Why single-pass agents fail on high-stakes outputs

Autoregressive models optimize for plausible continuation, not guaranteed correctness. Three failure modes dominate production agents where reflection pays off:

Local coherence, global error — each sentence reads well but the aggregate claim contradicts a source document, tool result, or prior turn. Users spot this only after acting on the answer.
Tool observation drift — the model summarizes a JSON payload incorrectly (swapped IDs, rounded currency, dropped null fields) because it never re-reads the raw observation after drafting.
Constraint amnesia — system prompts specify tone, format, and forbidden actions, but long context windows dilute them by the final token. A critique pass re-anchors constraints against the actual draft.

Reflection does not replace retrieval, permissions, or human approval on irreversible actions. It adds a quality gate inside the run so obvious mistakes never reach the channel.

Reflection taxonomy: patterns that scale

Production teams converge on a small set of loop shapes. Pick based on error cost, latency budget, and whether critique needs a separate model:

Self-refine (same model, two roles)

One model generates a draft, then receives a follow-up prompt: “List errors in the draft against these criteria; if any, rewrite.” Cheap to ship, but the same blind spots can persist — the model often rubber-stamps its own work. Use only for low-risk formatting fixes or when paired with deterministic checks.

Draft-critic (separate prompts or models)

A generator produces the draft; a critic (often a smaller, cheaper model with a rigid JSON schema) scores dimensions like factual grounding, policy compliance, and completeness. The generator revises only on failed dimensions. Harbor Compliance uses this pattern: generator cites clauses; critic must return { "span_ok": bool, "issues": [...] } with PDF page offsets.

Verifier tool loop (deterministic + LLM)

The critic is partly code: run SQL EXPLAIN, execute against a read replica, validate JSON Schema, diff cited spans against source text. The LLM revises only when deterministic checks fail. This hybrid gives the highest precision per dollar on structured outputs.

Chain-of-verification (CoVe)

The model drafts, generates verification questions, answers them independently (without seeing the draft), then reconciles contradictions into a final answer. Strong for open-domain factual Q&A; weaker when answers must cite private tool payloads unless those payloads are re-injected into each verification step.

Supervisor reflection (multi-agent)

A supervisor agent reviews subagent trajectories before merge — common in delegation graphs. The supervisor checks plan adherence, tool budget, and output policy rather than rewriting prose line by line.

Pipeline architecture: where reflection sits

Treat reflection as middleware in the agent hook pipeline, not an ad-hoc prompt at the end:

Pre-generation gates — intent classification, retrieval, tool routing (upstream of any draft).
Draft generation — first model pass or tool loop completion; store immutable draft_v1 artifact.
Critique pass — critic receives draft + grounding artifacts (source spans, tool JSON, policy checklist); returns structured pass/fail per dimension.
Revision or escalate — on fail: regenerate with critique injected; on repeated fail: degrade to safe template or HITL queue.
Final validation — schema and policy gates on the post-reflection artifact only.
Delivery — stream or persist; log full trace including drafts and critique JSON for audit.

Key invariant: users never see draft_v1 unless you explicitly offer a “show reasoning” debug mode. Partial streams during reflection confuse counsel and support agents alike.

Iteration budgets, stop rules and cost control

Unbounded reflection can burn 5× tokens and still loop. Production systems cap iteration explicitly:

Max critique rounds — typically 1–2; Harbor Compliance stops at 2 revisions then escalates to human review.
Per-dimension stop — if only tone fails, revise tone only; do not re-retrieve or re-call tools unless grounding failed.
Token ceiling — tie to per-run budgets; reflection shares the same envelope as planning and tools.
Latency SLO — for chat, run critique asynchronously and show “reviewing…” only if p95 stays under your product threshold; otherwise batch reflection post-stream for logging only (not ideal for legal, but acceptable for drafts).
Diminishing returns detector — if critique issue count does not decrease between rounds, abort and escalate.

Harbor Compliance refactor: from 29% to 3.1% citation errors

Harbor’s contract agent ingests PDFs, extracts text blocks with page anchors, and answers natural-language questions (“What is the termination notice period?”). The broken single-pass flow let the model invent section numbers that sounded right. The refactor added:

Grounding bundle — every draft must include citations[] with page, start_char, end_char, and quoted span text.
Deterministic span check — server verifies each quote is a substring of the extracted block; failures skip LLM critique and force immediate revision prompt.
Critic model (Haiku-class) — checks: (1) does the answer entail the cited span? (2) are uncited claims present? (3) does tone match policy (“no legal advice” disclaimer)? Returns JSON only.
Revision prompt — generator receives critic JSON + original spans; may not add new citations without new retrieval.
Escalation — two failed rounds → ticket to paralegal queue with full trace.

Citation errors dropped 29% → 3.1%. Mean cost per successful answer rose 38% (critic + one revision on ~40% of runs) — far cheaper than attorney rework. Escalation rate stabilized at 6%.

Technique decision table

Approach	Best when	Avoid when
Single-pass (no reflection)	Low-stakes chat, creative writing, internal brainstorming	Citations, money, SQL writes, regulated outbound comms
Self-refine (same model)	Format polish, short summaries with cheap retry	Systematic hallucination on facts the model never retrieved
Draft-critic (structured JSON)	Multi-criteria QA, policy + grounding checks, auditable traces	Sub-200ms latency requirements without async UX
Verifier tools + LLM revision	SQL, APIs, numeric proofs, span-level document grounding	Unstructured prose where deterministic oracles do not exist
Chain-of-verification	Open-domain factual Q&A with public knowledge	Private tool payloads unless re-injected per verification step
Supervisor over subagents	Multi-step workflows with delegation and plan drift risk	Simple single-tool Q&A

Common pitfalls

Critic rubber-stamping — critic uses the same model and temperature as generator; errors correlate. Use smaller model, lower temperature, or deterministic checks first.
Reflection without grounding artifacts — critic only sees the draft, not source PDFs or tool JSON; it invents plausible pass grades.
Infinite revise loops — no max rounds or diminishing-returns abort; runs hit loop termination limits or budget caps mid-stream.
Streaming partial drafts — users act on uncorrected text before critique finishes.
Critique prompt drift — criteria live only in a wiki; version critique schemas with deploy fingerprints like canary releases.
Replacing HITL with reflection — irreversible wire transfers and legal filings still need human gates; reflection reduces volume, not accountability.

Production checklist

Classify tasks by error cost; enable reflection only above threshold.
Store immutable draft artifacts and critique JSON per run.
Run deterministic verifiers before LLM critique when oracles exist.
Use structured critic output (JSON Schema), not free-text “looks good.”
Cap critique rounds (1–2) with escalation on persistent fail.
Share reflection token cost with per-run budgets and cost attribution.
Do not stream user-visible output until final pass completes (or label interim).
Inject grounding artifacts (spans, tool payloads) into every critique call.
Log pass/fail dimensions for offline eval regression suites.
Version critique rubrics alongside model and prompt deploys.
Measure error rate and cost/latency jointly — not accuracy alone.
Pair reflection with HITL for irreversible or regulated actions.

Key takeaways

Reflection is in-run quality control, not offline eval — it catches errors before delivery.
Draft-critic with deterministic verifiers beats same-model self-refine on high-stakes tasks.
Harbor Compliance cut citation errors 29% → 3.1% with span checks, JSON critic, and two-round cap.
Iteration budgets are mandatory — unbounded loops destroy latency and token economics.
Grounding artifacts must reach the critic or reflection becomes theater.