Guide
LLM agent cancellation, timeout and lifecycle management explained
Harbor Platform's schema-migration agent ran inside a chat UI with a prominent
Stop button. An engineer clicked it when the agent started bulk-inserting
rows into a staging table that looked wrong. The streaming tokens stopped immediately
— but the background worker kept calling execute_sql for another
three minutes. By the time the process exited, 840,000 rows sat in a
half-migrated state; rollback scripts took four hours. Cancel had only torn down the
Server-Sent Events connection, not the orchestration loop or in-flight tool calls.
Cancellation and lifecycle management define how an agent run is born, paused, stopped, or allowed to finish — and what happens to side effects when it does not complete normally. This is distinct from loop termination (when the model decides it is done) and from checkpointing (how state survives restarts). Harbor replaced UI-only stream abort with a typed run finite-state machine, cooperative cancel tokens propagated into tools, and compensating workflows on mutating steps. Orphaned partial migrations fell from 19% of cancelled runs to 1.2%, and mean cleanup time dropped from 2.4 hours to eleven minutes. This guide covers cancel taxonomy, timeout layers, run FSM design, tool and subagent cleanup, integration with durability and observability, the Harbor Platform refactor, a technique decision table, pitfalls, and a production checklist.
Cancel and timeout taxonomy
Production agents receive stop signals from many sources. Treating them all as “user clicked Stop” hides different cleanup requirements.
Who initiates the stop?
- User cancel — explicit UI abort, ESC key, or API
DELETE /runs/{id} - Policy timeout — wall-clock SLA (e.g. 120 s support reply), per-step stall detector, or idle-since-last-tool threshold
- Budget exhaustion — token, cost, or tool-call count cap hit (overlaps termination but triggers hard external kill)
- Parent cascade — supervisor run cancelled; children must inherit the signal within one scheduler tick
- Operational drain — deploy, scale-in, or circuit-breaker open; finish current mutating step then refuse new work
Cooperative vs forced shutdown
Cooperative cancellation sets a shared AbortSignal (or
equivalent) that the agent loop and each tool wrapper polls between LLM calls and
between chunked I/O. The runtime waits up to a grace window for in-flight HTTP requests
to complete or abort cleanly. Forced cancellation kills the worker
process or thread after the grace window — fast, but risks leaving external
systems inconsistent unless every mutating tool is idempotent or has a compensating
action registered before execution.
Rule of thumb: cooperative for read-only and idempotent tools; forced only after compensating hooks are wired or the blast radius is provably zero.
Run lifecycle as an explicit FSM
Implicit status strings ("running", "error") sprawl in
logs. A small finite-state machine makes transitions auditable and prevents illegal
jumps (e.g. resuming a cancelled run without an explicit retry).
Recommended states
queued— accepted, not yet scheduledrunning— LLM or tool actively executingawaiting_approval— blocked on human gate; cancel still allowedcancelling— stop requested; draining in-flight toolscancelled— terminal; partial artifacts retained with reasoncompleted/failed— other terminals
Transition guards
Every edge should log who triggered it and why. From
running to cancelling, set a monotonic
cancel_requested_at timestamp and increment a generation counter so
stale async callbacks cannot advance state after a newer run supersedes the session.
Persist FSM state in the same store as
checkpoints
so a crashed worker during cancelling can resume cleanup on restart.
Timeout layers that stack, not compete
Teams often set one global timeout and wonder why agents still run away. Layer timeouts so each catches a different failure mode.
Wall-clock envelope
Hard cap on total run duration. Starts at enqueue or first token. When fired, transition
to cancelling regardless of model enthusiasm. Pair with user-visible
countdown in long-running UIs.
Per-step and per-tool budgets
Limit a single search_logs or execute_sql call to 30 s.
Prevents one hung JDBC connection from consuming the entire wall-clock budget silently.
Tool wrappers should map provider timeouts to structured errors the planner can replan
around — unless cancel already fired, in which case skip replan and drain.
LLM inference timeout
Separate from tool time: streaming stalls when the model stops emitting tokens but the HTTP connection stays open. Use idle-token watchdogs (no chunk for N seconds) in addition to overall request timeout. On fire, cancel the upstream stream before the agent loop schedules another tool.
Relationship to termination policy
Loop termination answers “should the agent choose to stop?” Lifecycle management answers “must the runtime stop it anyway?” Termination predicates should run first; timeouts are the backstop when the model never yields.
Tool cleanup and compensating actions
The expensive part of cancel is not stopping token generation — it is unwinding side effects already committed.
Register before execute
For mutating tools, record a compensation descriptor in the checkpoint
before the call returns success: inverse SQL, S3 version id to restore, ticket id to
reopen. If cancel lands mid-call, use outbox pattern: mark intent
pending_commit and only finalize after idempotent confirmation.
Cancellable tool contracts
Document which tools honor AbortSignal. Long polls and subprocess runs
should accept cancel and return a typed CancelledError rather than generic
500s. Read-only tools can ignore cancel mid-flight; mutating tools must not.
Idempotency keys
Retries after ambiguous cancel (client disconnected, server unsure if tool finished) must not double-charge or double-insert. Scope keys per run id + step index; store completion receipts in the checkpoint ledger.
Subagent cascade and parallel branches
When a parent spawns
subagents,
cancel must fan out in one transaction: parent enters cancelling, publishes
cancel event on a bus keyed by root_run_id, children transition in parallel.
Do not await each child sequentially — the last child may be the one holding the
database lock.
For parallel plan DAG branches, cancel marks all non-terminal nodes
skipped_cancelled and runs compensations in reverse topological order
(dependents before prerequisites) when rollbacks have ordering constraints.
Harbor Platform refactor walkthrough
Harbor's failure mode was split brain: the API gateway stopped SSE to the browser, but the Celery worker executing the agent loop had no cancel channel. Fix sequence:
- Introduced
run_id-scoped Redis pub/sub cancel channel; UI publish on Stop - Agent loop checks cancel flag before every LLM call and after every tool return
- SQL tool wrapped with statement timeout +
AbortSignalon JDBC cancel - Checkpoint store records
migration_batch_idper insert batch for targeted delete - On
cancelledterminal, auto-run compensation DAG (delete staging batches, release advisory lock) - Traces tag
cancel_reasonandcleanup_duration_msfor SLO review
Partial-migration incidents on cancel dropped 19% → 1.2%. Median cleanup fell from 2.4 hours to eleven minutes because operators no longer ran manual scripts — the runtime owned teardown.
Technique decision table
| Approach | Best for | Weak when |
|---|---|---|
| UI stream abort only | Read-only demos, zero side effects | Any mutating tool; misleads users into thinking work stopped |
| Cooperative cancel + FSM | Production agents with mixed read/write tools | Runaway native code ignoring signals unless paired with kill timeout |
| Forced process kill | Stuck workers, memory leaks, unresponsive sandboxes | Without compensation registry — guarantees inconsistency |
| Let run finish after user cancel | Irreversible single-step jobs already 95% done | User trust; runaway cost; wrong-direction migrations |
| Deploy drain (finish step, no new steps) | Graceful rollouts, Kubernetes preStop | Steps longer than drain budget — need hard cap anyway |
Common pitfalls
- Cancel stops UI but not worker — the Harbor failure mode; always propagate to orchestration.
- No
cancellingintermediate state — users see instant “cancelled” while tools still run; breaks trust and audits. - Ignoring cancel during LLM streaming — buffer tokens after stop and bill for them; close upstream on first cancel bit.
- Subagents outlive parent — orphan workers become zombie cost centers and lock holders.
- Compensation runs without idempotency — double-cancel retries delete production data twice.
- Timeout without observability — cannot tune SLAs; emit structured
cancel_reasonon every terminal. - Confusing terminate with cancel — successful completion should not run rollback DAGs.
Production checklist
- Define versioned run FSM with legal transitions and terminal states.
- Propagate
AbortSignal(or equivalent) from API through loop into tools. - Separate wall-clock, per-tool, and LLM idle-stream timeouts.
- Publish cancel on durable channel keyed by
run_id, not just SSE disconnect. - Register compensation descriptors before mutating tool execution.
- Cascade cancel to all subagents within one scheduler generation.
- Persist
cancellingstate in checkpoint store for crash recovery. - Run compensations in safe order; idempotency keys on cleanup steps.
- Tag traces with
cancel_reason,cleanup_duration_ms,orphan_side_effects. - Integration test: user cancel mid-mutation leaves system consistent.
- Alert on cancel rate spikes and on runs stuck in
cancelling> 60 s. - Document user-visible behavior: what Stop guarantees vs best-effort.
Key takeaways
- Cancel is a runtime obligation, not a UI detail — workers and tools must receive the signal.
- Harbor cut orphaned migrations 19% → 1.2% with FSM + compensation DAGs.
- Layer timeouts (wall-clock, per-tool, LLM idle) instead of one opaque cap.
- Cooperative shutdown needs a forced backstop and registered compensations.
- Bind lifecycle to checkpointing and tracing so cancel is recoverable and measurable.
Related reading
- Durable agent execution — persisting FSM state and compensation receipts
- Agent loop termination — when the model should stop vs when the runtime must
- Retry and fallback resilience — after ambiguous cancel or timeout errors
- Agent observability — spans for cancel and cleanup SLOs