Guide

LLM agent loop termination explained

Harbor Analytics' on-call triage agent was supposed to find the root cause of a payment-gateway timeout in under two minutes. Instead it entered a loop: call search_logs with the same time window and filter, receive 4,200 rows, compress them, think “I need more context,” and call search_logs again. Twelve iterations later the ticket still had no summary — but the run had consumed 380,000 tokens and $47 in API spend. The model never refused to continue; the runtime never told it to stop.

Agent loop termination is the set of rules and detectors that decide when a multi-step ReAct or plan-and-execute agent should exit: with a final answer, a partial result, or a handoff to a human. Without explicit termination, agents inherit the model's bias toward “one more tool call” — fine in demos, expensive and risky in production. After Harbor layered iteration caps, stagnation fingerprints, and a token budget governor on top of goal predicates, runaway loops dropped 91% and median ticket resolution time fell from 4.1 to 1.6 minutes. This guide covers stopping-criteria taxonomy, budget envelopes, success and failure predicates, oscillation detection, graceful degradation, the Harbor Analytics refactor, a technique decision table versus open-ended loops, pitfalls, and a production checklist.

Why termination is a first-class subsystem

Tool-using agents are state machines wrapped around an LLM. Each turn the model chooses: answer now, call a tool, or (if allowed) ask a clarifying question. Left unconstrained, frontier models often prefer action over closure — especially when observations are noisy or incomplete. Termination is not a single max_steps=10 constant; it is a policy layer that sits between the orchestrator and the model, evaluating every loop edge before the next inference call.

Production teams care about termination for four reasons:

  • Cost — each extra turn multiplies prompt tokens (growing scratchpad) plus tool latency; runaway loops dominate LLM spend on agent workloads.
  • Latency — users waiting on a support chat or internal copilot expect sub-minute responses; unbounded loops blow SLAs.
  • Safety — agents that never stop may keep calling write tools, re-fetching PII, or hammering rate-limited APIs.
  • Quality — paradoxically, extra iterations after a sufficient answer often degrade accuracy as the scratchpad fills with contradictory observations.

Stopping-criteria taxonomy

Mature runtimes combine several independent stop signals. Treat them as an OR across success paths and an AND across guardrails:

1. Hard budget caps

Non-negotiable ceilings evaluated before every turn:

  • Max iterations — simplest guard; typical support agents use 5–15, research agents 20–40. Count both tool calls and pure reasoning turns if your parser emits them separately.
  • Token budget — cumulative input+output tokens per session; pair with context compression so the cap reflects useful work, not padding.
  • Cost budget — dollar or credit ceiling per user, ticket, or org; essential for multi-tenant SaaS.
  • Wall-clock timeout — absolute elapsed time including tool I/O; prevents a single slow SQL query from blocking the loop indefinitely.

2. Goal and success predicates

Soft stops when the task is objectively complete:

  • Structured finish tool — require the model to call submit_answer or finish with a JSON schema; the runtime validates fields before exiting.
  • Parser-detected final answer — regex or grammar for Final Answer: blocks in ReAct transcripts; fragile alone but fine as a secondary signal.
  • External verifier — lightweight classifier or rules engine that scores whether required slots (order ID, error code, citation URLs) are populated.
  • Coverage threshold — for RAG agents, stop when retrieved chunks cover all sub-questions in a decomposed query plan.

3. Stagnation and oscillation detection

The Harbor incident was stagnation: identical actions with identical arguments producing observations the model failed to integrate. Detectors include:

  • Action fingerprint — hash of (tool_name, canonical_args); halt after N repeats (usually 2).
  • Observation similarity — embedding distance between consecutive tool results; if cosine similarity > 0.98 for three turns, inject a system nudge or stop.
  • State oscillation — A → B → A tool alternation (e.g. get_user then list_orders then get_user again) without new facts.
  • Novelty budget — track unique facts extracted per turn; stop when the marginal information gain drops below a threshold.

4. Failure and handoff paths

Not every stop is success. Define explicit failure exits:

  • Tool error budget — after K consecutive tool failures, stop and surface errors to the user (see tool error handling).
  • Confidence floor — if the model's self-reported confidence or a calibrated scorer stays below a threshold for two turns, escalate.
  • Human-in-the-loop handoff — package scratchpad, tool trace, and partial answer for an operator queue.
  • Degraded single-shot fallback — one final non-tool completion using compressed context when budgets exhaust.

Designing the termination state machine

Implement termination as explicit states, not scattered if checks:

  1. RUNNING — model may act; budgets decrement after each turn.
  2. VERIFYING — optional; run output schema validation or a cheap chain-of-verification pass before accepting a finish signal.
  3. STOPPED_SUCCESS — return validated answer to caller.
  4. STOPPED_BUDGET — return partial answer + reason code.
  5. STOPPED_STAGNATION — return best-effort summary + suggest human review.
  6. ESCALATED — enqueue for human with full trace.

Log the first triggering condition for every session. Teams that only log “max iterations exceeded” miss opportunities to tune stagnation thresholds or improve tool schemas.

Harbor Analytics refactor (worked example)

Harbor's log-triage agent used LangGraph-style nodes: plan, search, compress, diagnose, answer. Before termination work, the graph allowed unlimited revisits to the search node. The refactor added:

  • A session envelope — 12 iterations, 120k tokens, $3.00 cost, 90s wall clock (whichever hits first).
  • A search fingerprint set — canonicalized JSON args; duplicate fingerprint aborts with a forced transition to diagnose using cached observations.
  • A required-evidence checklist — error code, first failing span ID, and deployment version must appear in scratchpad facts before submit_incident_summary is callable.
  • A pre-stop compression pass — when 80% of token budget is consumed, run observation summarization (see tool result compression) and inject “budget warning” into the system prompt.

Runaway loops fell 91%. Median resolution time improved because agents stopped sooner after sufficient evidence rather than chasing redundant log pages.

Technique decision table

Approach Best for Weak when
Max iterations only Prototypes, cheap models, narrow tool sets Long-horizon research; burns budget on stagnation
Token/cost governor Multi-tenant SaaS, variable task depth Needs per-tenant accounting and compression
Finish-tool + schema validation Support bots, form-filling agents Model forgets to call finish; needs nudges
Stagnation fingerprinting Search-heavy agents, analytics copilots Legitimate pagination looks like repeats
External verifier / CoVe High-stakes answers, compliance workflows Extra latency and cost per session
Harbor-style layered policy Production agents with diverse tools More orchestrator code to maintain
Open-ended loop (no termination) Local dev only Never ship to production

Common pitfalls

  • Stopping only on max iterations — agents burn the full budget every time instead of exiting early when done.
  • Counting iterations without tool latency — five slow SQL calls can exceed SLA while under the step cap.
  • Ignoring partial success — returning empty output on budget exhaustion instead of the best draft so far.
  • Fingerprinting without canonicalization{"limit": 100} vs {"limit":"100"} bypasses duplicate detection.
  • Stopping before error recovery — stagnation halt should not trigger on legitimate retry after transient 503s; reset fingerprints on successful recovery.
  • No observability on stop reason — you cannot tune thresholds you do not measure.
  • Termination only in the prompt — “Stop when you have enough information” is unreliable; enforce in code.
  • Same limits for all task classes — refund lookups need fewer steps than multi-doc legal review.

Production checklist

  • Define per-task-class budget envelopes (iterations, tokens, cost, wall clock).
  • Require a structured finish tool or validated final-answer schema for success exits.
  • Implement action fingerprinting with canonical JSON argument hashing.
  • Detect observation similarity stagnation across consecutive tool results.
  • Wire tool error budgets to escalation, not blind retry loops.
  • Inject budget-warning system messages at 70–80% consumption.
  • Return partial answers with explicit reason codes on non-success stops.
  • Log first triggering stop condition per session for tuning.
  • Pair termination with observation compression before context overflows.
  • Test runaway scenarios in CI (duplicate-tool mocks, empty result sets).
  • Document handoff payload format for human-in-the-loop queues.
  • Review stop-reason dashboards weekly during agent rollout.

Key takeaways

  • Agent loop termination is a runtime policy layer — not a prompt suggestion — that prevents cost blowups, SLA breaches, and unsafe tool spam.
  • Combine hard budgets, goal predicates, stagnation detection, and graceful handoff paths; no single signal is sufficient alone.
  • Harbor Analytics cut runaway loops 91% by fingerprinting duplicate searches, enforcing evidence checklists, and governing token spend.
  • Always log which stop condition fired first; that metric drives tuning more than raw iteration counts.
  • Pair termination with ReAct loop design, tool error handling, and result compression — stopping criteria are only as good as the observations they evaluate.

Related reading