Guide
LLM agent observability and tracing explained
Harbor Analytics’ on-call agent cleared
79% of golden incident tasks
in CI, yet production escalations still burned 47 minutes of engineer
time per failure — not fixing the outage, but reconstructing what the agent
did. Logs showed a final answer and a wall of unstructured stdout; nobody could see
that step 4 called run_query against the staging warehouse, step 6
retried the same failing tool twice, and step 8 skipped the verification span
entirely. After shipping structured traces with per-step spans, token attribution,
and one-click replay into the
sandbox,
mean time to debug dropped to 11 minutes and repeat-incident rate
fell 34%.
Agent observability is the practice of recording multi-step ReAct-style runs as queryable traces — not just chat transcripts. This guide covers span hierarchies, what to instrument on each turn, latency and cost attribution, PII redaction and sampling, connecting traces to eval suites and online evaluation, the Harbor Analytics refactor, a technique decision table versus plain chat logs and generic APM, pitfalls, and a production checklist.
Why chat logs are not enough
A typical agent run spans dozens of internal events: model prefill, tool dispatch, external API round trips, observation compression, policy gate checks, and loop termination. Storing only the user message and final assistant reply hides the failure mode. You cannot answer:
- Which tool added 18 seconds of latency?
- Did the model see a truncated observation on step 5?
- Was the retry loop caused by a 429 or a schema validation error?
- How many input tokens did tool results consume before the budget trim?
- Did two parallel tool calls race and return conflicting facts?
Observability treats each agent run as a trace: a tree of spans with timestamps, attributes, and parent-child links. That model maps cleanly onto OpenTelemetry conventions and most vendor dashboards (Datadog, Honeycomb, Langfuse, Phoenix, LangSmith, etc.) without locking you to one stack.
Span hierarchy for agent runs
Standardize one root span per user-facing task (or per background job), then nest children for each logical unit of work:
trace: incident_resolve_8f3a
span: agent.run (root — task_id, user_id hash, model_id)
span: llm.generate (step 1 — prompt_tokens, completion_tokens, latency_ms)
span: tool.run_query (step 1 action — args hash, warehouse, row_count)
span: llm.generate (step 2)
span: tool.get_alert_threshold (step 2 action)
span: policy.gate (refund_cap check — allow/deny)
span: llm.generate (step 3 — finish_reason: stop)
span: agent.finalize (summary emitted to user)
Key attributes on the root span: agent_version,
prompt_template_hash, session_id,
termination_reason (goal_met, max_steps, budget_exceeded, user_cancel).
On each llm.generate span: model name, temperature, input/output token
counts, time-to-first-token, and whether
context budget
trimming fired. On each tool.* span: tool name, argument schema
version, HTTP status or error code, payload byte size before compression, and
idempotency key if applicable.
ReAct step envelope
Even if you do not expose chain-of-thought to users, record a structured step record server-side:
{
"step": 4,
"thought_ref": "span_id_abc", // optional internal reasoning span
"action": { "tool": "run_query", "args_digest": "sha256:..." },
"observation_digest": "sha256:...",
"observation_bytes": 4821,
"observation_truncated": true,
"latency_ms": 1240,
"policy_flags": []
}
Store full observations in object storage keyed by digest; traces carry digests and byte counts so engineers can fetch payloads on demand without bloating the trace backend.
Metrics that matter in production
Traces answer “what happened on this run?” Metrics answer “is the fleet healthy?” Export both from the same instrumentation:
- Task success rate — goal met within step and cost caps (align with eval definitions).
- Steps per task — p50/p95; spikes often precede retry storms.
- Tool error rate by tool — split 4xx schema errors from 5xx upstream failures.
- Tool latency p95 — per tool and per dependency region.
- Tokens per task — input vs output; attribute tool-result inflation.
- Cost per successful task — model + tool API charges where measurable.
- Time to first useful action — steps before first non-noop tool call.
- Human escalation rate — tasks handed to operators despite agent completion.
Alert on rate of change, not static thresholds: a jump in
run_query latency p95 or a doubling of median steps per task after a
prompt deploy is more actionable than a fixed error count.
PII, secrets and sampling
Agent traces tempt over-collection. Production defaults should be conservative:
- Redact at ingest — regex and NER passes on observations before persistence; never log raw API keys, tokens, or full customer records.
- Hash stable identifiers — store
user_id_hashandaccount_id_hashfor correlation without reversible PII in cold storage. - Tiered retention — hot traces 7–14 days with full step metadata; cold archive keeps digests and metrics only.
- Sampling — head-sample 100% of errors and policy denials; probabilistic sample (e.g. 5–10%) of successes. Always sample canary and shadow traffic at 100%.
- Break-glass access — audited replay of full observations for on-call roles only, with justification logged.
Sampling must be trace-coherent: if you keep a trace, keep all its spans. Partial traces mislead debug sessions worse than no trace.
Trace replay and eval bridge
The highest-leverage observability feature is replay: feed a captured trace (or a single failed step) into your sandbox with mocked tools, then diff the new trajectory against production. Workflow:
- Filter traces where
task_success=falseorhuman_escalation=true. - Cluster by
first_failing_toolanderror_class. - Promote top clusters into golden tasks in your regression suite.
- On prompt or tool schema change, replay N recent production traces in CI shadow mode.
This closes the loop between online evaluation and offline benchmarks: production discovers edge cases; eval prevents regressions; traces prove which fix addressed which cluster.
Harbor Analytics refactor
Before structured tracing, Harbor’s on-call agent logged JSON lines to
CloudWatch with no shared schema. Engineers grep’d by request_id
across three services. After the refactor:
- Unified
trace_idpropagated from API gateway through the agent worker and tool sidecars. - OTLP export to Honeycomb with 10% success sampling and 100% error sampling.
- Tool observations stored in S3 with SHA-256 digests; traces link via
observation_digest. - Dashboard: steps-per-task, tool error heatmap, token budget trim rate.
- “Replay in sandbox” button on any failed trace for on-call.
Outcomes over six weeks: mean time to debug agent failures 47 → 11 minutes; repeat incidents on the same root cause −34%; golden-task suite grew from 42 to 97 scenarios sourced from production traces. Eval pass rate rose 79 → 86% as clusters became test cases rather than anecdotes.
Technique decision table
| Need | Prefer | Avoid |
|---|---|---|
| Debug one failed production run | Full trace + observation fetch by digest | Grep unstructured chat logs |
| Fleet health after deploy | Metrics + sampled traces + eval CI | Manual spot-check of final answers |
| Latency regression | Per-span latency breakdown (LLM vs tool) | End-to-end wall clock only |
| Cost attribution | Token spans per step + tool API counters | Monthly invoice postmortems |
| Compliance audit | Redacted traces + break-glass policy | Full prompt logging by default |
| Low-volume internal agent | 100% trace retention short window | Heavy APM stack with no agent semantics |
| High-volume customer agent | Error-heavy + stratified success sampling | Store every observation verbatim 90 days |
Common pitfalls
- Logging prompts verbatim — compliance incidents; use digests and redaction pipelines.
- Flat log lines without trace_id — cannot reconstruct multi-service agent runs.
- Missing tool spans — LLM latency looks fine while external APIs stall.
- Inconsistent step numbering — breaks replay and eval alignment.
- Truncation invisible in traces — engineers debug “hallucinations” that were context drops.
- Metrics without eval definitions — “success” in dashboards disagrees with QA.
- 100% sampling at scale — cost explosion; trace backend becomes the bottleneck.
- No link from trace to prompt version — cannot bisect regressions across deploys.
Production checklist
- Assign one
trace_idper agent task; propagate across all services. - Nest spans:
agent.run→llm.generate/tool.*/policy.*. - Record per-step structured envelope with action, observation digest, and latency.
- Attribute input/output tokens and model id on every LLM span.
- Store full observations in object storage; traces carry digests only.
- Redact PII and secrets at ingest; hash user identifiers.
- Sample successes; retain 100% of errors, policy denials, and canary traffic.
- Export task success, steps-per-task, tool error rate, and cost-per-success metrics.
- Build dashboards for latency breakdown and post-deploy step-count shifts.
- Wire “replay in sandbox” from failed traces into CI golden tasks.
- Tag traces with
prompt_template_hashandagent_version. - Document break-glass observation access and audit requirements.
Key takeaways
- Agents need trace semantics, not just chat transcripts — spans expose tool and budget failures.
- Observation digests + object storage keep traces queryable without storing megabytes per run.
- Harbor Analytics cut debug time 47 → 11 minutes with OTLP traces and sandbox replay.
- Bridge traces to eval by promoting failed production clusters into golden tasks.
- Sample intelligently — full fidelity on errors, statistical coverage on successes.
Related reading
- Agent evaluation and benchmarking — trajectory metrics that traces should feed
- ReAct agent loop — the step anatomy your spans should mirror
- Online evaluation — production scoring paired with trace sampling
- Sandbox execution — safe replay of captured trajectories