Guide

LLM agent production feedback and correction loops explained

Harbor Legal shipped a contract-review agent that extracted key fields — parties, effective date, indemnity cap, governing law — into a review grid paralegals could edit before export. Demos scored well on a 200-contract holdout. In production, telemetry told a different story: on 34% of sessions someone manually corrected the indemnity_cap field, often on clause types the model had already mishandled in QA. Thumbs-up/down buttons existed but were decorative — clicks were logged to analytics, not wired to eval, prompts, or output validation rules. The same extraction mistake recurred week after week because the organization treated user edits as UI friction, not labeled signal.

A production feedback and correction loop closes the gap between live usage and durable quality. It captures structured correction events (what the model said, what the human changed, and why), routes high-value cases into review queues, promotes stable fixes into golden regression tests, and feeds curated pairs into offline alignment when volume justifies it. This guide covers correction event schemas, inline-edit diffing, promotion gates, stratified eval set maintenance, privacy scrubbing, integration with human-in-the-loop routing and agent benchmarking, the Harbor Legal refactor, a technique decision table, pitfalls, and a production checklist.

What production feedback is (and is not)

Production feedback is observed user reaction to a specific agent output in context — not a generic NPS survey and not an offline labeling project disconnected from real prompts, tools, and documents. Useful signals include:

Explicit ratings — thumbs, stars, “was this helpful?” on the final message or per tool step.
Implicit negatives — immediate regenerate, undo, abandon mid-flow, or escalate to human support within N seconds.
Inline corrections — edits to structured fields, rewritten paragraphs accepted after the model draft, or manual tool-arg overrides before commit.
Free-text rationale — optional “what was wrong?” when someone downvotes or edits; gold for taxonomy building.

The loop’s job is to turn those signals into actionable artifacts: regression tests that fail on the old behavior, validator rules that encode repeated mistakes, prompt diffs with measured lift, and eventually preference rows for DPO — not to accumulate a data lake nobody reads.

Correction event schema

Every correction should be a first-class event you can replay. Minimum fields:

run_id / span_id — tie to agent traces and deterministic replay cassettes.
step_index — which tool call or message was corrected.
field_path — JSON pointer for structured outputs (e.g. /clauses/3/indemnity_cap).
model_value and corrected_value — store hashes if raw text is sensitive; keep encrypted blobs for legal replay.
correction_type — enum: typo, wrong_source, policy_violation, missing_context, formatting, other.
user_role — paralegal vs admin; weight review priority.
document_fingerprint — clause type or template ID, not necessarily full PII.

Capture before state (prompt, retrieved chunks, tool results) at correction time or via trace lookup. Without context, a corrected dollar amount is useless for debugging whether retrieval or extraction failed.

Diffing inline edits

For free-text corrections, store character-level or token-level diffs plus a normalized form (strip whitespace, unify currency formatting). Harbor Legal grouped diffs by clause template: “indemnity cap missing currency code” became a validator rule faster than “user changed 2M to USD 2,000,000” repeated 400 times as unique strings.

Loop stages: capture, triage, promote, train

Capture — instrument UI and API commit paths so every accepted edit emits an event; never rely on client-only logging.
Triage — auto-bucket by field, correction_type, and frequency; surface spikes in a daily dashboard (“governing_law wrong +220% this week”).
Review — sample high-impact buckets into a human queue; adjudicators mark whether the correction is ground truth or user mistake.
Promote — approved cases become golden tests in CI and optional policy rules in guardrails.
Train — batch curated preference pairs into offline DPO/SFT on a schedule, with holdout eval gates — see preference data curation.

Most teams stall at capture. The highest ROI stage for agent apps is usually promote to golden tests — a single failing test prevents regressions on the next prompt tweak without waiting for a training job.

Golden test promotion gates

Not every correction should become a permanent test. Harbor Legal used:

Recurrence — same field_path + document_fingerprint corrected at least 3 times in 14 days.
Adjudication — senior reviewer marks ground truth; disagreements stay in queue.
Stability — corrected value does not change across reviewers for the same anonymized fixture.
PII scrub — synthetic or redacted document variant stored in repo; full doc in encrypted vault only.

Promoted tests run on every deploy against the live prompt + tool stack, with deterministic replay for tool responses where possible. Failing golden tests block release unless an explicit waiver documents intentional behavior change.

Stratified eval sets from production

Holdout sets drift when production traffic shifts — new contract templates, new regulations, new tool versions. Refresh eval slices monthly:

Sample corrections proportionally by correction_type and business line, not uniformly random (rare policy errors matter more than formatting nits).
Cap per-user contribution to avoid one power user dominating the set.
Keep a frozen core of 50–100 golden cases for longitudinal comparison; append a rolling tail of recent production failures.
Track pass rate per slice, not one headline number — extraction accuracy can rise while summarization falls.

Privacy, consent, and retention

Correction payloads often contain PII, PHI, or privileged legal text. Minimum bar:

Disclose in product policy that edits may be used to improve the system.
Redact or tokenize before writes to shared analytics stores.
Separate operational retention (30–90 days for debugging) from training retention (scrubbed fixtures only).
Honor deletion requests: map user_id to correction events and purge or anonymize.

Enterprise contracts sometimes forbid training on customer data — golden tests can still use customer-derived fixtures stored only in the tenant’s environment with synthetic exports for your CI.

Red flags when the loop is broken

High edit rate, flat eval pass rate — corrections are not promoted.
Thumbs without field linkage — you know users are unhappy, not what to fix.
Same ticket theme weekly — support knows the bug; engineering has no regression test.
Training on raw edits without adjudication — users sometimes fix correct outputs to match a wrong internal template.
Golden tests skipped in CI — “too flaky” usually means non-deterministic tools without cassettes.
Correction taxonomy never updated — everything buckets to other.

Harbor Legal refactor: 34% to 6% repeat errors

Harbor’s six-week rebuild: Week 1: wired grid cell commits to correction events with full trace IDs. Week 2: dashboard of top field_path + template pairs; indemnity cap dominated. Week 3: promoted 28 golden fixtures after adjudication; added policy rule requiring currency code when amount present. Week 4: retrieval tweak for indemnity sections + validator rule; repeat indemnity_cap corrections fell sharply. Week 5–6: exported 1,200 adjudicated pairs into a quarterly DPO batch; online repeat-error rate on top five fields dropped from 34% to 6% (remaining cases were genuinely ambiguous clauses routed to human review).

Median session time -12% (less re-typing); golden test count 28 to 94; deploy-blocking regressions caught 11 times in the next quarter before prod.

Technique decision table

Approach	Best for	Weak when
Thumbs only	Early sentiment, A/B headline metrics	Structured agent outputs needing field-level fix
Inline edit capture	Extraction, forms, codegen with accept/edit	Chat-only products with no structured surface
Golden test promotion	Repeat failures, release gates	One-off edge cases never seen again
Validator rules from patterns	Deterministic policy (currency, dates)	Subjective summarization quality
Offline DPO/SFT batches	Thousands of adjudicated pairs	<200 quality labels or high label noise
Real-time model routing on low score	Latency-tolerant fallback tier	Corrections needed are retrieval not model

Common pitfalls

Treating edits as success — a completed session with twelve manual fixes is a failure mode, not completion.
No link to traces — impossible to reproduce without prompt + retrieval snapshot.
Overfitting golden tests to one customer — cap fixtures per tenant.
Training on pre-correction model output as “rejected” without storing the paired accepted edit.
Ignoring positive signal — untouched high-confidence extractions are useful negatives for contrastive eval.
Feedback UI fatigue — ask rationale only on downvote or large diffs, not every keystroke.
Loop bypass for internal users — staff edits are often the highest-quality labels.

Production checklist

Every structured commit path emits a correction event with trace ID.
Schema includes model_value, corrected_value, field_path, correction_type.
Daily dashboard: edit rate by field, template, and model version.
Adjudication queue for recurring high-impact buckets.
Golden test promotion rules documented and enforced in CI.
Stratified eval refresh monthly with frozen core + rolling tail.
PII scrub pipeline before shared storage; retention policy published.
Regression block on golden failure unless waived with rationale.
Quarterly export to preference datasets only after label QC.
Close the loop: ship fix, verify edit rate drops on affected slice.

Key takeaways

User corrections are labeled data — capture them with the same rigor as offline labeling.
Promote to golden tests before training jobs — fastest path to stopping repeat errors.
Link every event to traces and context — otherwise you cannot debug or replay.
Adjudicate before DPO — not every edit is ground truth.
Measure repeat-error rate per field — Harbor Legal cut indemnity-cap rework from 34% to 6% with a closed loop.