Guide

LLM agent cancellation, timeout and lifecycle management explained

Harbor Platform's schema-migration agent ran inside a chat UI with a prominent Stop button. An engineer clicked it when the agent started bulk-inserting rows into a staging table that looked wrong. The streaming tokens stopped immediately — but the background worker kept calling execute_sql for another three minutes. By the time the process exited, 840,000 rows sat in a half-migrated state; rollback scripts took four hours. Cancel had only torn down the Server-Sent Events connection, not the orchestration loop or in-flight tool calls.

Cancellation and lifecycle management define how an agent run is born, paused, stopped, or allowed to finish — and what happens to side effects when it does not complete normally. This is distinct from loop termination (when the model decides it is done) and from checkpointing (how state survives restarts). Harbor replaced UI-only stream abort with a typed run finite-state machine, cooperative cancel tokens propagated into tools, and compensating workflows on mutating steps. Orphaned partial migrations fell from 19% of cancelled runs to 1.2%, and mean cleanup time dropped from 2.4 hours to eleven minutes. This guide covers cancel taxonomy, timeout layers, run FSM design, tool and subagent cleanup, integration with durability and observability, the Harbor Platform refactor, a technique decision table, pitfalls, and a production checklist.

Cancel and timeout taxonomy

Production agents receive stop signals from many sources. Treating them all as “user clicked Stop” hides different cleanup requirements.

Who initiates the stop?

User cancel — explicit UI abort, ESC key, or API DELETE /runs/{id}
Policy timeout — wall-clock SLA (e.g. 120 s support reply), per-step stall detector, or idle-since-last-tool threshold
Budget exhaustion — token, cost, or tool-call count cap hit (overlaps termination but triggers hard external kill)
Parent cascade — supervisor run cancelled; children must inherit the signal within one scheduler tick
Operational drain — deploy, scale-in, or circuit-breaker open; finish current mutating step then refuse new work

Cooperative vs forced shutdown

Cooperative cancellation sets a shared AbortSignal (or equivalent) that the agent loop and each tool wrapper polls between LLM calls and between chunked I/O. The runtime waits up to a grace window for in-flight HTTP requests to complete or abort cleanly. Forced cancellation kills the worker process or thread after the grace window — fast, but risks leaving external systems inconsistent unless every mutating tool is idempotent or has a compensating action registered before execution.

Rule of thumb: cooperative for read-only and idempotent tools; forced only after compensating hooks are wired or the blast radius is provably zero.

Run lifecycle as an explicit FSM

Implicit status strings ("running", "error") sprawl in logs. A small finite-state machine makes transitions auditable and prevents illegal jumps (e.g. resuming a cancelled run without an explicit retry).

Recommended states

queued — accepted, not yet scheduled
running — LLM or tool actively executing
awaiting_approval — blocked on human gate; cancel still allowed
cancelling — stop requested; draining in-flight tools
cancelled — terminal; partial artifacts retained with reason
completed / failed — other terminals

Transition guards

Every edge should log who triggered it and why. From running to cancelling, set a monotonic cancel_requested_at timestamp and increment a generation counter so stale async callbacks cannot advance state after a newer run supersedes the session. Persist FSM state in the same store as checkpoints so a crashed worker during cancelling can resume cleanup on restart.

Timeout layers that stack, not compete

Teams often set one global timeout and wonder why agents still run away. Layer timeouts so each catches a different failure mode.

Wall-clock envelope

Hard cap on total run duration. Starts at enqueue or first token. When fired, transition to cancelling regardless of model enthusiasm. Pair with user-visible countdown in long-running UIs.

Per-step and per-tool budgets

Limit a single search_logs or execute_sql call to 30 s. Prevents one hung JDBC connection from consuming the entire wall-clock budget silently. Tool wrappers should map provider timeouts to structured errors the planner can replan around — unless cancel already fired, in which case skip replan and drain.

LLM inference timeout

Separate from tool time: streaming stalls when the model stops emitting tokens but the HTTP connection stays open. Use idle-token watchdogs (no chunk for N seconds) in addition to overall request timeout. On fire, cancel the upstream stream before the agent loop schedules another tool.

Relationship to termination policy

Loop termination answers “should the agent choose to stop?” Lifecycle management answers “must the runtime stop it anyway?” Termination predicates should run first; timeouts are the backstop when the model never yields.

Tool cleanup and compensating actions

The expensive part of cancel is not stopping token generation — it is unwinding side effects already committed.

Register before execute

For mutating tools, record a compensation descriptor in the checkpoint before the call returns success: inverse SQL, S3 version id to restore, ticket id to reopen. If cancel lands mid-call, use outbox pattern: mark intent pending_commit and only finalize after idempotent confirmation.

Cancellable tool contracts

Document which tools honor AbortSignal. Long polls and subprocess runs should accept cancel and return a typed CancelledError rather than generic 500s. Read-only tools can ignore cancel mid-flight; mutating tools must not.

Idempotency keys

Retries after ambiguous cancel (client disconnected, server unsure if tool finished) must not double-charge or double-insert. Scope keys per run id + step index; store completion receipts in the checkpoint ledger.

Subagent cascade and parallel branches

When a parent spawns subagents, cancel must fan out in one transaction: parent enters cancelling, publishes cancel event on a bus keyed by root_run_id, children transition in parallel. Do not await each child sequentially — the last child may be the one holding the database lock.

For parallel plan DAG branches, cancel marks all non-terminal nodes skipped_cancelled and runs compensations in reverse topological order (dependents before prerequisites) when rollbacks have ordering constraints.

Harbor Platform refactor walkthrough

Harbor's failure mode was split brain: the API gateway stopped SSE to the browser, but the Celery worker executing the agent loop had no cancel channel. Fix sequence:

Introduced run_id-scoped Redis pub/sub cancel channel; UI publish on Stop
Agent loop checks cancel flag before every LLM call and after every tool return
SQL tool wrapped with statement timeout + AbortSignal on JDBC cancel
Checkpoint store records migration_batch_id per insert batch for targeted delete
On cancelled terminal, auto-run compensation DAG (delete staging batches, release advisory lock)
Traces tag cancel_reason and cleanup_duration_ms for SLO review

Partial-migration incidents on cancel dropped 19% → 1.2%. Median cleanup fell from 2.4 hours to eleven minutes because operators no longer ran manual scripts — the runtime owned teardown.

Technique decision table

Approach	Best for	Weak when
UI stream abort only	Read-only demos, zero side effects	Any mutating tool; misleads users into thinking work stopped
Cooperative cancel + FSM	Production agents with mixed read/write tools	Runaway native code ignoring signals unless paired with kill timeout
Forced process kill	Stuck workers, memory leaks, unresponsive sandboxes	Without compensation registry — guarantees inconsistency
Let run finish after user cancel	Irreversible single-step jobs already 95% done	User trust; runaway cost; wrong-direction migrations
Deploy drain (finish step, no new steps)	Graceful rollouts, Kubernetes preStop	Steps longer than drain budget — need hard cap anyway

Common pitfalls

Cancel stops UI but not worker — the Harbor failure mode; always propagate to orchestration.
No cancelling intermediate state — users see instant “cancelled” while tools still run; breaks trust and audits.
Ignoring cancel during LLM streaming — buffer tokens after stop and bill for them; close upstream on first cancel bit.
Subagents outlive parent — orphan workers become zombie cost centers and lock holders.
Compensation runs without idempotency — double-cancel retries delete production data twice.
Timeout without observability — cannot tune SLAs; emit structured cancel_reason on every terminal.
Confusing terminate with cancel — successful completion should not run rollback DAGs.

Production checklist

Define versioned run FSM with legal transitions and terminal states.
Propagate AbortSignal (or equivalent) from API through loop into tools.
Separate wall-clock, per-tool, and LLM idle-stream timeouts.
Publish cancel on durable channel keyed by run_id, not just SSE disconnect.
Register compensation descriptors before mutating tool execution.
Cascade cancel to all subagents within one scheduler generation.
Persist cancelling state in checkpoint store for crash recovery.
Run compensations in safe order; idempotency keys on cleanup steps.
Tag traces with cancel_reason, cleanup_duration_ms, orphan_side_effects.
Integration test: user cancel mid-mutation leaves system consistent.
Alert on cancel rate spikes and on runs stuck in cancelling > 60 s.
Document user-visible behavior: what Stop guarantees vs best-effort.

Key takeaways

Cancel is a runtime obligation, not a UI detail — workers and tools must receive the signal.
Harbor cut orphaned migrations 19% → 1.2% with FSM + compensation DAGs.
Layer timeouts (wall-clock, per-tool, LLM idle) instead of one opaque cap.
Cooperative shutdown needs a forced backstop and registered compensations.
Bind lifecycle to checkpointing and tracing so cancel is recoverable and measurable.