Guide

LLM agent testing, mocking and E2E harness explained

Harbor DevOps merged a pull request that added a “rollback failed deploy” agent workflow. The nightly regression suite passed — but only because it called the real OpenAI API with a live Postgres staging database. Three days later, a dependency bump changed token sampling enough that the agent skipped the verify_healthcheck tool call. Production rolled back the wrong Kubernetes namespace. Post-mortem found 38% of CI runs on agent PRs had been failing intermittently; engineers had been retrying until green.

Agent testing harnesses exercise orchestration code, tool wiring, and policy gates without depending on non-deterministic model output or shared staging data on every commit. This is distinct from offline evaluation (scoring trajectories on golden tasks) and from sandbox isolation (containing blast radius). Harbor split tests into three layers: pure unit tests with stubbed LLM responses, integration tests with record-replay cassettes, and a thin nightly E2E slice against a disposable environment. PR failure rate on agent changes fell from 38% to 4%, and the namespace regression would have been caught by a deterministic trajectory assertion. This guide covers the agent test pyramid, mocking strategies, cassette design, harness APIs, CI placement, the Harbor DevOps refactor, a technique decision table, pitfalls, and a production checklist.

The agent test pyramid

Single-turn prompt tests do not cover multi-step agents. Map coverage to layers with different cost and flakiness profiles.

Unit layer (milliseconds, no network)

  • Tool schema validation — argument parsers reject malformed JSON before any HTTP call
  • Policy gates — permission tiers block mutating tools when approval state is missing
  • State machine transitions — run FSM legal edges without invoking the model
  • Prompt assembly — snapshot system prompts and tool manifests; catch accidental token bloat

Integration layer (seconds, mocked LLM + mocked tools)

Feed canned assistant messages that mirror a recorded trajectory. Assert the harness invokes the correct next tool with expected arguments, handles tool errors, and reaches a terminal state. This is where most agent regressions belong.

E2E layer (minutes, real stack slice)

A small, curated set of tasks runs against ephemeral infrastructure with either live models (temperature 0, fixed seed where supported) or approved cassettes refreshed on schedule. Reserve for cross-service contracts and deploy smoke, not every PR.

Mocking the LLM: fixtures, stubs and cassettes

Non-determinism is the root flake. Choose a strategy based on what you need to prove.

Hard-coded response queues

Push an ordered list of assistant messages into a fake LLM client. Each chat.completions.create call pops the next item. Fastest for unit tests; brittle when prompts change wording but tool calls stay the same.

Record-replay cassettes

On first run (or a dedicated --record mode), call the real API and serialize request hashes plus responses to versioned YAML. CI replays from disk. Redact API keys and PII at write time. Re-record cassettes when you intentionally change prompts or model versions — treat cassette diffs like snapshot review.

Tool-choice matchers

Instead of matching full natural-language replies, assert structural properties: “next message must include a tool_calls entry for list_pods with namespace label.” Pair with minimal stub text so tests survive harmless phrasing drift.

Seeded live calls (sparingly)

Some providers expose seed parameters or logprobs for repeatability. Use only in nightly jobs with budget caps; never as the default PR gate.

Mocking tools and external systems

Agents fail in the glue between model decisions and side effects. Fake the boundaries, not the orchestration loop.

  • In-memory fakesFakeK8sClient with preset pod lists; assert rollback targets the canary namespace
  • HTTP wire mocks — Mountebank, WireMock, or msw for REST tools; match on method, path, and body subset
  • Fault injection — Return 503 on third execute_sql call; verify retry policy and fallback resilience
  • Latency simulation — Delay tool responses to exercise timeout and cancel paths without wall-clock waits in unit tests (injectable clocks)
  • Idempotency keys — Fake stores dedupe by key; replay same tool call twice and assert single side effect

Register mocks in a dependency-injection container keyed by test name so parallel CI workers do not share mutable global state.

Harness API design

A good harness looks like a test SDK, not a pile of copy-pasted setup blocks.

Scenario builder

const run = await harness
  .withLlmCassette('rollback-happy-path.yaml')
  .withToolFake('k8s', new FakeK8s({ namespaces: ['canary', 'prod'] }))
  .withInitialState({ run_id: 'test-1', approval_tier: 2 })
  .executeAgent({ user: 'Rollback deploy v2.4.1' });

Trajectory assertions

  • expect(run.toolsCalled).toEqualSequence(['get_deploy', 'verify_healthcheck', 'rollback'])
  • expect(run.terminalState).toBe('succeeded')
  • expect(fakeK8s.rollbackTarget).toBe('canary')

Trace export for debugging

On failure, dump OpenTelemetry-style spans (or a JSON trace) to the CI artifact store. Engineers should see which stub returned unexpected data without reproducing locally with live keys.

Golden task files

Store scenario definitions as data: user message, expected tool sequence, optional negative cases. Share the same files between CI harness and offline eval — eval scores quality; harness asserts invariants.

CI placement and flake control

  • PR gate — unit + integration with cassettes; target < 90 s total on agent packages
  • Merge queue — optional replay against fresh cassettes if multiple agents touched shared prompts
  • Nightly — live-model E2E on disposable env; refresh cassettes when drift > threshold
  • Quarantine — flaky tests get an issue owner; do not retry: 3 without fixing root cause
  • Parallelism — isolate DB/schema per worker; agents that write shared staging tables will race

Tag tests with @agent and @cassette-v3 so model-version upgrades trigger selective re-recording instead of full-suite red.

Harbor DevOps refactor

Before the harness, Harbor’s agent package had twelve integration tests that each called OpenAI and a shared staging cluster. Flakes clustered around rate limits, stale deploy fixtures, and non-reproducible tool ordering when the model paraphrased success messages.

The team introduced AgentTestHarness with injectable LLM and tool ports, migrated eight scenarios to YAML cassettes, and added three pure unit tests for the rollback FSM. A new negative case asserted verify_healthcheck must run before rollback — the exact gap that caused the namespace incident. PR agent-test failure rate dropped from 38% to 4%; median CI time on agent PRs fell from 14 minutes to 3.2 minutes because live API calls moved to nightly only.

Harbor wired harness trace dumps into the same backend as production observability, so failing CI runs were inspectable with the same UI engineers used for on-call.

Technique decision table

Approach Best for Weak when
Live API on every PR Early prototypes, zero harness investment Flakes, cost, non-reproducible failures at scale
Hard-coded LLM stubs FSM, policy, and parser unit tests Prompt wording changes break tests with no functional regression
Record-replay cassettes Stable integration regression on tool trajectories Model upgrades require deliberate re-record review
Eval-only gate (no harness) Quality benchmarking, leaderboard tracking Misses wiring bugs eval scenarios do not cover
Nightly live E2E Cross-service contracts, provider drift detection Too slow and expensive for per-commit feedback

Common pitfalls

  • Testing the model, not the agent — asserting exact prose instead of tool calls and terminal state.
  • Shared mutable staging — parallel CI workers corrupt each other’s data; use per-test fakes or isolated schemas.
  • Cassettes with secrets — API keys in committed YAML; redact at record time and scan in pre-commit.
  • No negative paths — only happy-path cassettes; production fails on tool 503s and missing approvals.
  • Harness bypasses production code — duplicate loop logic in tests; always execute the same orchestrator entrypoint.
  • Ignoring checkpoint resume — crash mid-run and resume must be tested when using durable execution.
  • Retry until green — masks flake rate; track test_flake_rate as an engineering metric.

Production checklist

  • Define agent test pyramid: unit, integration (cassettes), nightly E2E.
  • Injectable LLM port with stub, cassette, and live implementations.
  • Tool fakes with fault injection and idempotency assertions.
  • Scenario builder API shared across packages.
  • Trajectory assertions on tool name, args, and order — not full text.
  • Version cassettes with model ID and prompt hash metadata.
  • Redact secrets and PII when recording cassettes.
  • Export trace JSON on CI failure to artifact store.
  • Align golden task files with offline eval datasets.
  • Test cancel, timeout, and resume paths with injectable clocks.
  • PR gate under 90 s for agent integration suite.
  • Track and alert on test flake rate; quarantine chronic offenders.

Key takeaways

  • Separate harness from eval — harness proves wiring; eval scores quality on golden tasks.
  • Harbor cut PR failures 38% → 4% with cassettes and tool fakes.
  • Assert tool trajectories, not verbatim model prose.
  • Record-replay balances realism and determinism for CI.
  • Nightly live E2E catches provider drift without blocking every commit.

Related reading