Guide

LLM agent deterministic replay and run reproducibility explained

Harbor DevOps shipped a Kubernetes migration agent that reads cluster state, proposes a rollout plan, opens a change ticket, and applies Terraform through a gated pipeline. On a Tuesday night it approved a destructive node-pool replacement on a staging cluster that should have been excluded by policy. On-call pulled the OpenTelemetry trace: twelve ReAct steps, three parallel tool batches, and a final apply_terraform call with wrong arguments. Engineers tried to replay the run locally with the same user prompt. They failed 27% of the time on the first attempt and could not match the production trajectory in 73% of sessions even after ten tries. Temperature was 0.7 in prod but 0 in dev. Live describe_cluster returned different pod counts. The model version had rolled forward overnight. Nobody had stored the exact tool observations from the incident window.

Deterministic replay means re-executing an agent run so that every LLM completion and every tool observation matches a recorded baseline unless you explicitly swap in a new model or policy under test. Run reproducibility is the broader discipline: capturing enough of the environment — prompts, model IDs, decoding params, tool responses, clocks, feature flags — that another engineer can reproduce the failure or verify a fix. This complements agent tracing (what happened) and audit trails (provable history) with a rerunnable artifact. Harbor built run manifests, LLM cassettes, and tool replay stubs into their E2E harness. Incident reproducibility rose from 73% to 94%; mean time to root cause on P1 agent failures fell from 6.8 hours to 1.4 hours. This guide covers replay architecture, seeding and nondeterminism sources, record-replay modes, the Harbor refactor, a technique decision table, pitfalls, and a production checklist.

Why traces alone are not replay

Traces answer “what did the agent do?” They rarely answer “can I make it do that again?” Agent runs diverge for predictable reasons:

  • LLM sampling — temperature, top-p, and provider-side batching change token sequences even with identical prompts.
  • Model and prompt drift — silent router upgrades, prompt template changes, or prompt registry tags not pinned on the incident run.
  • Live tool variance — APIs return different JSON; pagination cursors move; rate-limit headers differ; clock skew shifts “now” in relative date tools.
  • Parallel orderingparallel tool batches may complete in different order; observations land in different sequence unless the runtime sorts them.
  • External state — saga ledgers, checkpoints, and idempotency keys from half-finished runs change the next turn.
  • Missing context — traces redact PII or truncate large tool payloads; replay without the full observation body misleads the model.

Reproducibility requires treating a run as a versioned bundle, not a screenshot of spans.

Run manifest: the minimum reproducibility contract

Harbor stores a run manifest at the start of every production agent session and updates it on each step:

  • run_id, parent_run_id, tenant, and policy version
  • Initial user message and system prompt hash (not always the raw text if secrets embed)
  • Model route: provider, model ID, API version, decoding params
  • Tool catalog snapshot hash (names, schemas, compensator pairs)
  • Feature flags, permission scope, and approval gate outcomes
  • Injected clock (2026-06-12T03:14:00Z) when tools use relative dates
  • Random seed or seed_mode: record for any local RNG in tools

On incident export, the manifest plus a step-indexed event log becomes the replay input. The event log append-only records: LLM_REQUEST, LLM_RESPONSE, TOOL_CALL, TOOL_RESULT, CHECKPOINT, POLICY_DECISION. Each entry carries content hashes so tampering is detectable, aligned with compliance audit patterns.

Full replay vs partial replay

  • Full deterministic replay — LLM responses served from cassettes; tools return recorded JSON. Re-executes orchestration logic only. Use for regression tests and exact incident reproduction.
  • Hybrid replay — cassettes for tools, live LLM with temperature 0 and pinned model. Use when testing prompt fixes against real model behavior while holding external APIs constant.
  • Live re-run — same manifest constraints but fresh LLM and tools. Use for flake detection, not single-incident RCA.

LLM cassettes and tool stubs

An LLM cassette maps a normalized request fingerprint to the exact completion bytes returned in production:

fingerprint = hash(model_id, messages[], tools[], decoding_params)
cassette[fingerprint] = { completion, logprobs?, finish_reason, usage }

Normalization matters: strip nondeterministic metadata, canonicalize JSON key order in tool definitions, and hash system prompts by registry version ID rather than raw string when templates include dynamic timestamps.

Tool stubs replay TOOL_RESULT events by (tool_name, arguments_hash, step_index). Step index disambiguates identical calls in one run (e.g. two get_pod calls with the same name but different namespaces on different turns). For mutating tools, stubs return recorded responses without side effects; compensators are not invoked during replay unless you run a saga simulation mode.

Harbor's harness fails CI if a replayed run produces a fingerprint miss — signaling orchestration drift or cassette staleness before merge.

Controlling nondeterminism at the source

Replay is cheaper when production is already replay-friendly:

  • Pin decoding in prod for high-risk agents — temperature 0 for migration, finance, and provisioning agents; document exceptions.
  • Stable observation ordering — sort parallel tool results by step_index before appending to context.
  • Inject clocks — pass now() from the runtime into tools instead of calling wall clock inside tool implementations.
  • Version everything — model route, prompt tag, tool schema bundle, policy engine ruleset.
  • Record, don't rely on re-fetch — store full tool payloads in the event log (with PII scrubbing at write time), not just span summaries.
  • Separate replay sandboxes — replay never hits production APIs even if cassettes miss; default to stub mode with explicit opt-in to live.

Harbor DevOps refactor walkthrough

Harbor's incident agent had failed because step 7's describe_cluster observation listed zero staging taints on a cluster that did have an exclusion taint in prod — a transient cache bug in their wrapper. Re-running with live tools never reproduced step 7 exactly; the model sometimes skipped the taint check entirely.

After refactor:

  1. Every prod run wrote manifest + event log to object storage within 200 ms of step completion.
  2. On-call clicked “export replay bundle” from the trace UI; bundle included manifest, cassettes for all LLM calls, and tool results.
  3. Local CLI harbor-agent replay --bundle id=run_8f2a --mode full re-executed orchestration; step 7 matched prod byte-for-byte.
  4. Engineers swapped only the taint-parser fix into the tool stub layer and re-ran hybrid mode (live LLM, stubbed tools) to verify the model now blocked apply.
  5. Regression test added: full replay of run_8f2a must end in POLICY_DENY after fix merge.

Metrics: reproducibility 73% → 94% (remaining gap = runs before manifest rollout and PII-redacted payloads too short to replay). P1 MTTR 6.8 h → 1.4 h. CI replay suite grew to 840 cassettes covering top agent workflows.

Technique decision table

Approach Best for Weak when Harbor-style signal
Trace-only debug Quick latency and error taxonomy Exact trajectory reproduction 10 replays, 7 different outcomes
Full record-replay Incident RCA, regression locks Testing novel model behavior Need byte-identical repro
Hybrid (stub tools, live LLM) Prompt and policy fixes Tool-wrapper bugs Model chose wrong action given fixed facts
Live re-run with pinned manifest Flake rate measurement Single incident debugging Probabilistic pass/fail gates
Counterfactual replay “What if step 5 returned X?&rdash; Large search space Policy what-if analysis

Common pitfalls

  • Cassette keyed only on user message — ignores tool observations that changed the next LLM request; false cache hits.
  • Storing truncated tool results in traces — replay context differs from prod; model takes a new path immediately.
  • Replaying against live mutating APIs — doubles charges, sends duplicate emails; always stub writes during full replay.
  • Ignoring parallel step order — cassettes miss when batch completion order flips.
  • No manifest on subagent runs — parent replay succeeds; delegated branch diverges silently.
  • PII redaction without replay tier — compliance scrubs bodies engineers need; use role-gated full-fidelity export.
  • Cassette rot after prompt change — mass false failures; version cassettes per prompt tag and prune on registry bump.
  • Assuming temperature 0 is enough — provider updates, batching, and tool variance still break live re-runs.

Engineer checklist

  • Emit a run manifest at session start with model, prompt, tool, and policy versions.
  • Append-only event log: LLM requests/responses, tool calls/results, checkpoints.
  • Normalize LLM request fingerprints; store completion bytes in cassettes.
  • Key tool stubs by tool name, arguments hash, and step index.
  • Sort parallel tool observations before context injection.
  • Inject runtime clock into date-sensitive tools.
  • One-click export replay bundle from trace or audit UI.
  • CLI/API: full, hybrid, and live re-run modes with sandbox defaults.
  • Fail CI on cassette miss when orchestration code changes.
  • Add regression replay for every P1/P2 agent incident after fix.
  • Role-gate full-fidelity exports; redact only in general audit streams.
  • Include subagent manifests in parent bundle exports.

Key takeaways

  • Traces show history; replay bundles let you rerun it.
  • Run manifests pin every versioned input — model, prompt, tools, policy, clock.
  • LLM cassettes plus tool stubs enable deterministic full replay without prod side effects.
  • Hybrid replay separates model bugs from tool and environment bugs.
  • Harbor cut P1 MTTR 6.8 h → 1.4 h by making incidents exportable artifacts, not archaeology exercises.

Related reading