Guide
LLM human-in-the-loop explained
Harbor Support's tier-one triage bot was fast but untrustworthy. It closed tickets with confident-sounding summaries that misread refund policy, routed enterprise outages to the self-serve FAQ, and once told a customer their data breach report was “resolved” because the model paraphrased a closing template. Legal demanded a kill switch; engineers wanted full autonomy. The compromise was human-in-the-loop (HITL): the model still drafts replies, classifies intent, and suggests macros — but anything above a risk score, below a confidence threshold, or touching regulated topics lands in a review queue where a human approves, edits, or rejects before the customer sees it. Escalation volume fell 31% while incorrect-send incidents went to zero over six weeks.
Human-in-the-loop is the practice of inserting people at decision points in an LLM pipeline — not as a permanent crutch, but as a safety layer, a labeling source, and a calibration signal. This guide covers approval gates and async review queues, real-time copilot vs autonomous send, capturing structured feedback for RLHF and evaluation, pairing with guardrails and LLM-as-judge scorers, the Harbor Support refactor, a technique decision table, pitfalls, and a production checklist.
What human-in-the-loop means for LLMs
HITL is not “humans write the prompts.” It is an explicit architecture choice about when model output becomes an irreversible action: sending an email, posting a trade, merging code, deleting records. Three layers appear in most production systems:
- Pre-generation gates — block or reroute requests before the model runs (policy classifiers, PII scanners, jurisdiction checks). Humans rarely touch these unless appeals exist.
- Post-generation review — the model produces a draft; a human or a second automated check must approve before publish. This is the classic review queue.
- Post-action feedback — after a human edits or overrides the model, capture the diff as training or evaluation data. Without this loop, HITL is pure cost with no compounding benefit.
The design tension is latency vs liability. Copilot-style products keep humans in the loop by default (user clicks Send). Autonomous agents need async queues, SLAs, and clear ownership when the queue backs up.
Review queues and approval gates
When to route to humans
Effective routing combines signals — no single score is enough:
- Model confidence — logprobs, calibrated classifiers, or an LLM-as-judge score below threshold (e.g. < 0.85 on policy adherence).
- Topic risk tier — refunds, medical, legal, security, minors, financial advice: fixed human review regardless of score.
- Customer tier — enterprise accounts, high LTV, or open executive escalations bypass auto-send.
- Novelty — retrieval missed all chunks above similarity cutoff; first occurrence of a new intent cluster.
- Tool side effects — any proposed action that mutates state (refund API, account deletion) requires approval unless on an allowlist.
Queue UX that reviewers actually use
Reviewers need context in one screen: original user message, retrieved sources with highlights, model draft, diff-friendly editor, one-click approve/reject, mandatory reason codes on reject, and estimated customer wait time. Hide raw chain-of-thought; show structured rationale fields the model was prompted to fill. SLA timers and queue depth alerts prevent silent backlog.
Partial automation patterns
Suggest-and-edit: human always changes something small before send — good for brand voice calibration. Auto-send with audit: low-risk path sends immediately; sample 2–5% for retroactive human audit to catch drift. Two-key: model + senior reviewer for irreversible actions. Pick one pattern per product surface; mixing them confuses metrics.
Active learning and feedback pipelines
HITL pays off when human decisions flow back into the system. Capture these artifacts on every review:
- Preference pairs — model draft (rejected) vs human final (chosen). Feeds DPO and reward model training.
- Edit spans — character-level diffs show whether errors are factual, tonal, or structural.
- Reason codes — enum labels (wrong_policy, hallucinated_citation, unsafe_content) enable targeted eval suites.
- Retrieval misses — flag when the human added facts not present in retrieved chunks; drives corpus gaps.
Batch exports weekly into your evaluation harness. Prioritize labeling disagreements where two reviewers conflict — those examples sharpen rubrics faster than random samples. Pair with observability traces so each queue item links to prompt version, model ID, and retrieval set.
Shrinking the queue over time
Track human touch rate (% of sessions requiring review) and override rate (% of auto-sends later corrected). A healthy program lowers touch rate while override rate stays flat or falls. If touch rate drops but overrides rise, you are auto-sending mistakes — tighten thresholds. Monthly retrain or prompt patches should target the top three reason codes.
Real-time copilot vs async autonomous agents
| Mode | Human role | Latency budget | Typical products |
|---|---|---|---|
| Copilot / assist | User is the reviewer; model suggests inline | Seconds; user waits on screen | IDE completion, email draft, CRM note writer |
| Async queue | Dedicated reviewer team; customer waits minutes–hours | Minutes to SLA (e.g. 15 min P95) | Support replies, content moderation, underwriting drafts |
| Supervised autonomy | Human monitors dashboard; intervenes on alerts | Seconds for agent; human offline | Deployment bots, data pipeline agents with rollback |
| Appeals-only HITL | Human sees cases only when user disputes | Hours | Community moderation, ad review at scale |
Autonomous tool-using agents should default to propose → approve → execute for mutating tools. Read-only tools (search, fetch ticket) can run freely. The approval step can be human or a hardened rules engine — but someone must own the allowlist.
Harbor Support triage refactor
Harbor's support stack: intent classifier, RAG over policy docs, reply generator, and Zendesk API writer. The refactor added HITL without killing throughput.
Routing rules shipped
- Risk matrix — 14 topic tags (refund, GDPR, outage, etc.) cross product SKU tier; any high×high cell forces queue.
- Dual scores — citation coverage (% of claims with chunk IDs) AND judge score on tone/policy; both must pass for auto-send.
- Customer-visible delay — queued tickets get an immediate holding reply with honest ETA; no silent queue.
- Reviewer macros — one-click insert of approved legal paragraphs; edits still logged as preference data.
Metrics after six weeks
Auto-send share: 41% → 58% as thresholds tuned on labeled data. Median queue wait 4.2 minutes (target < 15). CSAT on bot-handled threads rose 0.3 points once wrong sends stopped. Labeling pipeline produced 4,200 preference pairs for a quarterly DPO refresh on the reply model.
Technique decision table
| Approach | Best when | Skip when |
|---|---|---|
| Human-in-the-loop review queue | High-stakes or regulated outputs; need audit trail; moderate volume | Sub-second latency; millions of trivial classifications per hour |
| Guardrails only (no human) | Low stakes, reversible actions, strong automated tests | Customer-facing policy advice; irreversible mutations |
| LLM-as-judge auto-reject | Scale pre-filter before human queue; reduce reviewer load | Judge correlates poorly with human rubric (measure first) |
| Full automation | Internal summaries, draft-only artifacts, user always edits | Any send-without-preview path on external users |
| RLHF / DPO offline | Amortize human labels across all future traffic | Need immediate fix before next training cycle |
| Chain-of-verification | Factual Q&A with verifiable sources | Subjective tone, negotiation, creative writing |
Common pitfalls
- Queue as landfill — routing everything “just in case” trains reviewers to rubber-stamp. Tune thresholds from data.
- No feedback capture — humans edit but labels never reach training or eval; HITL cost never amortizes.
- Hidden automation — customers think a human wrote the reply when it was auto-sent; regulatory and trust risk.
- Context-poor review UI — reviewers cannot see retrieval sources; they guess instead of verifying.
- SLA neglect — queue depth grows until reviewers batch-approve without reading.
- Single global threshold — enterprise and free-tier need different risk tolerance.
- Confusing copilot with autonomy — user did not review because the UI implied the draft was “ready to send.”
Production checklist
- Map irreversible actions and assign minimum review tier per action type.
- Define risk taxonomy (topics, customer tiers, tool side effects).
- Implement dual signals: confidence plus citation or judge score.
- Build reviewer UI with sources, draft, editor, reason codes, and SLA timer.
- Log prompt version, model ID, retrieval set, and routing reason per item.
- Export preference pairs and edit spans on a fixed schedule.
- Track human touch rate, override rate, queue wait P95, and CSAT.
- Sample auto-sent traffic for retroactive audit (2–5%).
- Document customer-facing disclosure when AI drafts are involved.
- Run quarterly rubric calibration sessions with reviewers and legal.
Key takeaways
- HITL inserts humans before irreversible actions — not as prompt writers, but as approvers, editors, and labelers.
- Route on combined signals: confidence, topic risk, customer tier, retrieval gaps, and mutating tool calls.
- Capture edits and reason codes so human effort compounds into eval suites and preference training.
- Harbor Support reached 58% auto-send with zero incorrect-send incidents by pairing dual scores with a 15-minute queue SLA.
- Shrink the queue by fixing top reason codes — not by lowering standards until override rate spikes.
Related reading
- LLM guardrails explained — automated input/output filters before and after generation
- LLM-as-judge explained — rubric scoring to triage what reaches human review
- RLHF explained — turning human preferences into weight updates
- LLM evaluation and benchmarking explained — regression tests fed by reviewer labels