Guide

LLM durable agent execution and checkpointing explained

Harbor Logistics’ vendor onboarding agent was designed for a six-to-forty-eight-hour lifecycle: collect W-9s, run sanctions screening, open a NetSuite vendor record, route a purchase order through finance, and provision API credentials. It worked beautifully in staging where nothing restarted. In production, a routine deploy at hour thirty-six wiped the in-memory ReAct loop. The resumed process had no memory of the PO it had already filed. It called create_purchase_order again. Finance approved a duplicate $84,000 commitment before anyone noticed. Over three months, 23% of long runs produced at least one duplicate side effect after crash or deploy.

Durable agent execution treats each agent run as a persisted workflow: checkpoint state after every tool call, record idempotency keys for mutating actions, pause cleanly at human approval gates, and resume from the last committed step instead of replaying from scratch. This guide covers run state machines, checkpoint schemas, side-effect ledgers, pause/resume semantics, integration with observability and tool error recovery, the Harbor Logistics refactor, a technique decision table versus in-memory loops and generic job queues, pitfalls, and a production checklist.

Why in-memory agents fail in production

A typical prototype keeps conversation history, tool results, and plan state in a Python dict or Redis session keyed by session_id. That works until reality intervenes:

  • Process restarts — deploys, OOM kills, autoscaling scale-in, spot instance preemption.
  • Long human waits — legal review over a weekend; the worker must not hold memory or time out.
  • Partial tool success — payment API returns 200 but the connection drops before the agent records the receipt.
  • At-least-once delivery — queue consumers retry messages; the same step may execute twice without guards.
  • Multi-worker routing — step 4 runs on pod A; after resume, step 5 lands on pod B with no shared RAM.

Naive “restart the whole prompt from the beginning” wastes tokens, duplicates expensive tool calls, and can violate compliance if a mutating action runs twice. Durable execution separates orchestration state (what step are we on, what is committed) from ephemeral inference (the next model call).

Run state machine

Model each agent task as a finite state machine with explicit transitions. A minimal production set:

  • pending — accepted, not yet started.
  • running — actively executing model/tool steps.
  • waiting_tool — external async tool in flight (webhook expected).
  • waiting_human — blocked on approval or input from a review queue.
  • paused — operator or budget policy halted the run.
  • completed — terminal success; no further steps.
  • failed — terminal error; may allow manual retry from last checkpoint.
  • cancelled — user or policy aborted; side effects already committed stand.

Transitions must be atomic in your persistence layer: only one worker may move running → waiting_tool for a given run_id. Use optimistic locking (version column) or row-level locks. Emit state changes to your trace backend so operators see “stuck in waiting_human 52 hours” without reading logs.

Checkpoint schema

Persist a checkpoint after every committed step — not after every token streamed. A practical JSON document:

{
  "run_id": "run_8f3a…",
  "trace_id": "trace_91bc…",
  "step_index": 7,
  "agent_version": "vendor-onboard/v14",
  "prompt_template_hash": "a4e2c1…",
  "messages": [ /* full LLM context at this step */ ],
  "plan": { "current_goal": "provision_api_keys", "subtasks_done": ["kyc", "po"] },
  "tool_ledger": [
    { "step": 3, "tool": "create_vendor", "idempotency_key": "run_8f3a:step_3",
      "status": "committed", "external_id": "VND-4421", "observation_digest": "sha256:…" }
  ],
  "budget": { "tokens_in": 42000, "tokens_out": 8100, "usd_estimate": 1.24 },
  "state": "waiting_human",
  "waiting_for": { "type": "approval", "queue_id": "finance_po", "since": "2026-06-10T14:22:00Z" }
}

Store large observations out-of-band (S3, blob store) and reference by digest in the checkpoint. Keep the hot checkpoint row under a few hundred kilobytes so resume stays fast. Version the schema; migrations should read checkpoint_schema_version and upgrade on load.

What to checkpoint vs recompute

  • Always persist — messages, tool arguments, tool receipts, external IDs, idempotency keys, step index.
  • Safe to recompute — derived summaries if you can reproduce them deterministically from persisted messages.
  • Never rely on recompute — mutating tool outcomes; the ledger is the source of truth for what happened in the real world.

Idempotent tools and side-effect ledgers

Durable execution without idempotency duplicates damage. Every mutating tool should accept an idempotency_key scoped to run_id + step_index (or a UUID generated before the call and stored in the checkpoint). The tool server:

  1. Looks up prior result by key in a durable store.
  2. If found, returns the cached receipt without re-executing.
  3. If not found, executes, stores result, returns.

The agent’s tool ledger mirrors this: before invoking create_purchase_order, write status: pending with the key; after success, flip to committed with external_id. On resume, skip any step whose ledger entry is already committed. Pair with the patterns in tool error handling: distinguish retryable transport errors (safe to retry same key) from business-logic failures (mark failed, do not auto-retry).

Read-only tools (search, fetch document) need weaker guarantees but still benefit from short TTL caches keyed by arguments to avoid thundering herds on resume.

Pause, resume and human approval gates

Long-running agents spend most wall-clock time waiting, not inferring. When a step requires human sign-off:

  1. Checkpoint with state: waiting_human and a pointer to the review item.
  2. Release the worker process; do not block a GPU or hold a queue lease for days.
  3. On approval webhook, enqueue a resume job with run_id.
  4. Load checkpoint, append the human decision as a structured message, set state: running, continue from step_index + 1.

Rejections should branch explicitly: either terminal failed with reason, or a repair loop where the agent drafts a revised action and re-enters the queue. Tie into HITL confidence routing so only high-risk mutations pause; low-risk steps stream through. Enforce budget caps across pause boundaries — total tokens and wall-clock SLA include wait time.

Workflow engines and async tool callbacks

You can build durable execution on Postgres plus a job queue, but complex graphs (parallel sub-agents, timers, compensating transactions) map cleanly to workflow engines (Temporal, Restate, Inngest, LangGraph with a persistent checkpointer). Patterns that matter for LLM agents:

  • Activity timeouts — model calls get minutes; human gates get days with heartbeats.
  • Signals — inject “user cancelled” or “approval granted” without polling.
  • Child workflows — isolate a research sub-agent; parent waits on a result handle.
  • Saga compensations — if step 8 fails after step 5 committed, run defined rollback tools.

Whether you use a framework or roll your own, the invariant is the same: the workflow code must be deterministic given the checkpoint. Non-determinism belongs inside activities (model calls, HTTP) whose outputs are recorded in the ledger.

Harbor Logistics refactor

Harbor’s vendor onboarding agent previously ran as a single long-lived Celery task with in-process message history. After the duplicate PO incident, the team shipped:

  • Postgres agent_runs table with optimistic locking and the checkpoint schema above.
  • Idempotency middleware on all mutating internal APIs (create_vendor, create_po, provision_credentials).
  • Finance and legal approvals as waiting_human with Slack deep links; workers released within 200 ms of entering wait.
  • Resume jobs triggered by approval webhooks and by a nightly sweeper for stuck runs.
  • Dashboards: median wall time, time-in-waiting_human p95, duplicate-tool-call rate (target zero).

Outcomes over eight weeks: duplicate side effects on long runs 23% → 0%; successful resume after deploy or crash 94% (remaining 6% required manual intervention on corrupted checkpoints); median token spend per completed onboarding −18% because resume no longer replayed early research steps. Mean time to onboard a standard vendor 41 → 38 hours — modest wall-clock gain, but finance trust recovered enough to expand automated onboarding from 40% to 72% of new vendors.

Technique decision table

Need Prefer Avoid
Sub-minute chat agent In-memory session; optional trace export Full workflow engine per message
Multi-hour task with approvals Checkpoint + waiting_human + resume queue Blocking worker holding RAM for days
Mutating enterprise APIs Idempotency keys + tool ledger Blind retry on resume
Crash during tool call Ledger pending → reconcile or retry same key Assume failure and re-run committed step
Parallel research subtasks Child workflows with merged checkpoint One giant message thread
Debug stuck run State machine + trace_id in observability UI Grep worker stdout
Regulated audit trail Immutable ledger entries per committed tool Reconstructed chat-only history

Common pitfalls

  • Checkpointing only on success — crashes mid-tool leave ambiguous state; write pending before call.
  • Idempotency keys that change on resume — duplicates every side effect; key must be stable per logical step.
  • Storing megabyte observations in the DB row — slow resume; use digest + object storage.
  • No version field on checkpoints — two workers resume concurrently and diverge.
  • Replaying the model from step zero — burns budget and may choose different tools; restore messages from checkpoint.
  • Human wait without SLA alerts — runs rot in waiting_human for weeks.
  • Missing cancellation path — operators cannot stop a run that already committed partial work.
  • Ledger not visible in traces — on-call cannot tell committed vs attempted actions.

Production checklist

  • Assign stable run_id and propagate trace_id across resume attempts.
  • Define explicit run states and atomic transition rules with optimistic locking.
  • Checkpoint after every committed tool step; persist pending before mutating calls.
  • Require idempotency_key on all mutating tools; cache receipts server-side.
  • Maintain a tool ledger with pending | committed | failed per step.
  • Store large observations by digest; keep checkpoint rows lean.
  • Implement waiting_human with webhook resume and release workers immediately.
  • Run a sweeper for stuck states (waiting_tool, waiting_human past SLA).
  • On resume, skip ledger entries already committed; never double-execute.
  • Enforce token and cost budgets cumulatively across pauses.
  • Export metrics: resume success rate, duplicate-tool rate, time-in-wait p95.
  • Document manual recovery for corrupted checkpoints and partial compensations.

Key takeaways

  • Production agents are workflows, not single HTTP requests — state must survive restarts.
  • Checkpoints + idempotency keys prevent duplicate POs and double charges on resume.
  • Harbor Logistics cut duplicate side effects 23% → 0% with ledgers and approval pauses.
  • Release workers during human waits — durability is about stored state, not running processes.
  • Pair with tracing and HITL so operators see where a run is stuck and why.

Related reading