Guide

LLM agent evaluation and benchmarking explained

Harbor Analytics’ on-call agent passed 94% of single-turn SQL generation tests in CI. In production it failed 41% of multi-step incident workflows: it called run_query with the right syntax but on the wrong warehouse, retried the same failing tool three times, and once issued a page_oncall before confirming the alert threshold. The team had benchmarked answers, not trajectories. Rebuilding evaluation around task success, tool-sequence correctness, and sandbox replay lifted end-to-end incident resolution from 58% to 79% while cutting median tool calls per task from 7.2 to 4.1.

Agent evaluation measures whether a multi-step ReAct-style loop completes a goal safely and efficiently — not whether one isolated completion looks plausible. This guide covers outcome vs process metrics, golden task suites, trajectory judges, sandbox environments, regression CI gates, the Harbor Analytics refactor, a technique decision table versus single-turn eval and live-only monitoring, pitfalls, and a production checklist.

Why single-turn benchmarks fail for agents

Chat benchmarks (MMLU slices, HELM subsets, static Q&A golden sets) score one prompt and one response. Agents add state: prior tool results, partial plans, retry loops, and side effects on external systems. A model can produce a correct final sentence after an unsafe or wasteful path — or fail despite a reasonable-looking answer because it never called the verification tool.

Agent evaluation therefore needs at least three layers:

Outcome success — did the task complete? (ticket closed, file deployed, query returned expected row count)
Trajectory quality — were the right tools called in a sensible order with valid arguments?
Guardrail compliance — no forbidden tools, no PII leakage, spend within budget, termination before loop caps

Production teams that only track final-message quality discover regressions after users complain or on-call pages spike. Agent benchmarks belong in CI and in sampled online evaluation pipelines.

Core metrics taxonomy

Task-level outcome metrics

Task success rate (TSR) — binary or graded pass on the end state; gold standard when verifiers exist (SQL result hash, API state check).
Partial credit — rubric scores when outcomes are subjective (draft quality, summary completeness).
Time and cost to success — wall-clock steps, tokens, tool latency, dollars per resolved task.

Tool-use metrics

Tool selection accuracy — was the chosen tool in the allowed set and appropriate for the subgoal?
Argument validity — schema-valid JSON, correct IDs, no hallucinated parameters.
Tool error rate — HTTP 4xx/5xx, sandbox exceptions, permission denials per 100 tasks.
Redundant call rate — identical tool+args repeated without new information (oscillation signal).

Trajectory metrics

Steps to success — fewer is not always better; compare against a human or expert baseline.
Recovery rate — share of tasks that fail a tool call but eventually succeed after retry or alternate tool.
Forbidden action rate — calls to destructive tools without confirmation, or actions after budget exhaustion.

Safety and policy metrics

Track policy violations separately from task failure: unauthorized data access, prompt-injection succumbs, excessive autonomy. A high TSR with rising violation rate is a deploy blocker, not a win.

Evaluation environments: sandbox, replay, and shadow

Sandbox with fixture state

Run agents against mocked or containerized backends with deterministic fixtures. Harbor Analytics mounts a frozen Postgres snapshot and stub PagerDuty API so page_oncall returns scripted responses. Sandboxes enable parallel CI without production risk. See sandbox execution for isolation patterns.

Recorded trace replay

Capture production tool observations (redacted) and replay them when evaluating prompt or planner changes. Replay tests whether a new policy would have chosen better actions given the same observations — cheap counterfactual analysis.

Shadow mode on live traffic

New agent versions execute tools in read-only shadow or log-only mode alongside production. Compare trajectories and predicted actions without user impact. Pair with canary deployment when write tools are involved.

Golden task suites and regression CI

Curate golden tasks: realistic multi-step scenarios with clear success predicates. Each task should specify:

Initial user goal and optional conversation prefix
Allowed tool surface (subset of production catalog)
Success verifier (scripted check, not LLM opinion when possible)
Optional reference trajectory or maximum step budget
Tags: intent, difficulty, safety tier, flaky-tool simulation

Run the full suite on every prompt, tool-schema, or model change. Gate merges on TSR and guardrail thresholds, not average logprob. Stratify the suite: 60% happy path, 25% tool-error recovery, 15% adversarial or ambiguous goals. Under-weight happy path in aggregate scoring if production traffic is harder.

Verifiers vs judges

Prefer executable verifiers (assert row count, file hash, mock API state) over LLM-as-judge when the outcome is machine-checkable. Use judges for trajectory rubrics (“justified escalation”, “appropriate empathy”) and calibrate judges weekly against human labels on 50–100 traces.

Scoring trajectories with process reward models

Outcome-only scoring rewards lucky shortcuts and penalizes sound plans that hit flaky tools. Process scoring grades each step:

Was the subgoal identifiable from context?
Did the tool call advance the plan?
Was retry/backoff appropriate after errors?

Process reward models (PRMs) and step-level rubric judges help compare two agents that both succeed but one wastes 12 tool calls. For training pipelines, verifiable step checks align with verifiable rewards; for eval, lightweight judges on logged traces are usually enough.

Harbor Analytics refactor: from SQL accuracy to incident workflows

The failing v1 benchmark had 120 single-turn SQL prompts. Production incidents required: parse alert metadata, query metrics warehouse, compare against SLO, draft status update, optionally page on-call. v1 agents often skipped the SLO comparison or paged prematurely.

The v2 eval program added:

48 golden incident tasks with sandboxed Grafana and PagerDuty stubs; success = correct status classification without false pages.
Trajectory judge on 100% of CI runs; outcome verifier on final mock API state.
Metrics dashboard: TSR, false-page rate, median steps, tool-error recovery rate, cost per resolved incident.
Promotion gate: TSR ≥ 75% on full suite and false-page rate < 2% before canary; online resolution rate monitored for 72 hours.

TSR rose from 58% to 79%; false pages fell from 11% to 1.8%. The highest-leverage fix was not a smarter model but eval tasks that mirrored real tool orderings and penalized skipped verification steps.

Technique decision table

Approach	Best for	Weakness
Single-turn golden Q&A	Chat quality, RAG faithfulness, tone	Misses tool order, retries, side effects
Outcome-only agent TSR	Clear verifiable end states	Rewards unsafe shortcuts; ignores efficiency
Trajectory + outcome eval	Production agents with tools	Higher build cost; judge calibration needed
Public agent benchmarks (WebArena, etc.)	Research comparison, model shopping	Domain mismatch vs your tools and policies
Online eval only	Post-deploy drift detection	Slow feedback; risky as sole pre-merge gate
Human trajectory review	High-stakes workflows, calibration	Does not scale to every CI run

Mature teams stack sandbox golden tasks in CI, nightly judge sampling on production traces, and weekly human review on failures — connected to A/B gates for prompt and planner changes.

Common pitfalls

Testing the model, not the system — eval must include real tool schemas, retrieval, and termination rules shipped in production.
Stale golden tasks — API and policy changes silently invalidate success predicates; version tasks with your tool registry.
Non-determinism ignored — run stochastic agents N times or use fixed seeds; flake erodes trust in CI.
Judge overfitting — optimizing for a rubric judge that diverges from human on-call judgment.
Missing failure taxonomy — aggregate TSR hides “wrong tool” vs “right tool, bad args” vs “never terminated”; tag failures for targeted fixes.
No cost metrics — an agent that succeeds in 15 steps at 3× token cost is not production-ready.

Production checklist

Define task success with executable verifiers where possible.
Build a golden suite covering happy path, errors, and policy edge cases.
Log full trajectories: thoughts, tool calls, observations, outcomes.
Score tool selection, argument validity, and redundancy separately from TSR.
Run sandbox replay in CI on every agent-affecting change.
Calibrate trajectory judges against weekly human labels.
Track steps, tokens, and dollars per successful task.
Gate promotion on TSR and guardrail metrics, not single-turn accuracy.
Feed failed trajectories back into golden tasks and training data.
Pair offline agent eval with online resolution metrics post-deploy.
Version golden tasks with tool schemas and sandbox fixtures.
Document known flaky tasks; quarantine or fix, do not ignore red CI.

Key takeaways

Agents are trajectories, not single completions. Benchmark multi-step tool use.
Outcome verifiers beat judges when state is machine-checkable.
Harbor Analytics raised incident TSR from 58% to 79% with workflow-shaped golden tasks.
Process metrics catch unsafe success and wasteful loops.
Stack sandbox CI, online sampling, and human review on failures.