Guide

LLM agent observability and tracing explained

Harbor Analytics’ on-call agent cleared 79% of golden incident tasks in CI, yet production escalations still burned 47 minutes of engineer time per failure — not fixing the outage, but reconstructing what the agent did. Logs showed a final answer and a wall of unstructured stdout; nobody could see that step 4 called run_query against the staging warehouse, step 6 retried the same failing tool twice, and step 8 skipped the verification span entirely. After shipping structured traces with per-step spans, token attribution, and one-click replay into the sandbox, mean time to debug dropped to 11 minutes and repeat-incident rate fell 34%.

Agent observability is the practice of recording multi-step ReAct-style runs as queryable traces — not just chat transcripts. This guide covers span hierarchies, what to instrument on each turn, latency and cost attribution, PII redaction and sampling, connecting traces to eval suites and online evaluation, the Harbor Analytics refactor, a technique decision table versus plain chat logs and generic APM, pitfalls, and a production checklist.

Why chat logs are not enough

A typical agent run spans dozens of internal events: model prefill, tool dispatch, external API round trips, observation compression, policy gate checks, and loop termination. Storing only the user message and final assistant reply hides the failure mode. You cannot answer:

  • Which tool added 18 seconds of latency?
  • Did the model see a truncated observation on step 5?
  • Was the retry loop caused by a 429 or a schema validation error?
  • How many input tokens did tool results consume before the budget trim?
  • Did two parallel tool calls race and return conflicting facts?

Observability treats each agent run as a trace: a tree of spans with timestamps, attributes, and parent-child links. That model maps cleanly onto OpenTelemetry conventions and most vendor dashboards (Datadog, Honeycomb, Langfuse, Phoenix, LangSmith, etc.) without locking you to one stack.

Span hierarchy for agent runs

Standardize one root span per user-facing task (or per background job), then nest children for each logical unit of work:

trace: incident_resolve_8f3a
  span: agent.run                    (root — task_id, user_id hash, model_id)
    span: llm.generate               (step 1 — prompt_tokens, completion_tokens, latency_ms)
    span: tool.run_query             (step 1 action — args hash, warehouse, row_count)
    span: llm.generate               (step 2)
    span: tool.get_alert_threshold   (step 2 action)
    span: policy.gate                (refund_cap check — allow/deny)
    span: llm.generate               (step 3 — finish_reason: stop)
  span: agent.finalize               (summary emitted to user)

Key attributes on the root span: agent_version, prompt_template_hash, session_id, termination_reason (goal_met, max_steps, budget_exceeded, user_cancel). On each llm.generate span: model name, temperature, input/output token counts, time-to-first-token, and whether context budget trimming fired. On each tool.* span: tool name, argument schema version, HTTP status or error code, payload byte size before compression, and idempotency key if applicable.

ReAct step envelope

Even if you do not expose chain-of-thought to users, record a structured step record server-side:

{
  "step": 4,
  "thought_ref": "span_id_abc",      // optional internal reasoning span
  "action": { "tool": "run_query", "args_digest": "sha256:..." },
  "observation_digest": "sha256:...",
  "observation_bytes": 4821,
  "observation_truncated": true,
  "latency_ms": 1240,
  "policy_flags": []
}

Store full observations in object storage keyed by digest; traces carry digests and byte counts so engineers can fetch payloads on demand without bloating the trace backend.

Metrics that matter in production

Traces answer “what happened on this run?” Metrics answer “is the fleet healthy?” Export both from the same instrumentation:

  • Task success rate — goal met within step and cost caps (align with eval definitions).
  • Steps per task — p50/p95; spikes often precede retry storms.
  • Tool error rate by tool — split 4xx schema errors from 5xx upstream failures.
  • Tool latency p95 — per tool and per dependency region.
  • Tokens per task — input vs output; attribute tool-result inflation.
  • Cost per successful task — model + tool API charges where measurable.
  • Time to first useful action — steps before first non-noop tool call.
  • Human escalation rate — tasks handed to operators despite agent completion.

Alert on rate of change, not static thresholds: a jump in run_query latency p95 or a doubling of median steps per task after a prompt deploy is more actionable than a fixed error count.

PII, secrets and sampling

Agent traces tempt over-collection. Production defaults should be conservative:

  • Redact at ingest — regex and NER passes on observations before persistence; never log raw API keys, tokens, or full customer records.
  • Hash stable identifiers — store user_id_hash and account_id_hash for correlation without reversible PII in cold storage.
  • Tiered retention — hot traces 7–14 days with full step metadata; cold archive keeps digests and metrics only.
  • Sampling — head-sample 100% of errors and policy denials; probabilistic sample (e.g. 5–10%) of successes. Always sample canary and shadow traffic at 100%.
  • Break-glass access — audited replay of full observations for on-call roles only, with justification logged.

Sampling must be trace-coherent: if you keep a trace, keep all its spans. Partial traces mislead debug sessions worse than no trace.

Trace replay and eval bridge

The highest-leverage observability feature is replay: feed a captured trace (or a single failed step) into your sandbox with mocked tools, then diff the new trajectory against production. Workflow:

  1. Filter traces where task_success=false or human_escalation=true.
  2. Cluster by first_failing_tool and error_class.
  3. Promote top clusters into golden tasks in your regression suite.
  4. On prompt or tool schema change, replay N recent production traces in CI shadow mode.

This closes the loop between online evaluation and offline benchmarks: production discovers edge cases; eval prevents regressions; traces prove which fix addressed which cluster.

Harbor Analytics refactor

Before structured tracing, Harbor’s on-call agent logged JSON lines to CloudWatch with no shared schema. Engineers grep’d by request_id across three services. After the refactor:

  • Unified trace_id propagated from API gateway through the agent worker and tool sidecars.
  • OTLP export to Honeycomb with 10% success sampling and 100% error sampling.
  • Tool observations stored in S3 with SHA-256 digests; traces link via observation_digest.
  • Dashboard: steps-per-task, tool error heatmap, token budget trim rate.
  • “Replay in sandbox” button on any failed trace for on-call.

Outcomes over six weeks: mean time to debug agent failures 47 → 11 minutes; repeat incidents on the same root cause −34%; golden-task suite grew from 42 to 97 scenarios sourced from production traces. Eval pass rate rose 79 → 86% as clusters became test cases rather than anecdotes.

Technique decision table

Need Prefer Avoid
Debug one failed production run Full trace + observation fetch by digest Grep unstructured chat logs
Fleet health after deploy Metrics + sampled traces + eval CI Manual spot-check of final answers
Latency regression Per-span latency breakdown (LLM vs tool) End-to-end wall clock only
Cost attribution Token spans per step + tool API counters Monthly invoice postmortems
Compliance audit Redacted traces + break-glass policy Full prompt logging by default
Low-volume internal agent 100% trace retention short window Heavy APM stack with no agent semantics
High-volume customer agent Error-heavy + stratified success sampling Store every observation verbatim 90 days

Common pitfalls

  • Logging prompts verbatim — compliance incidents; use digests and redaction pipelines.
  • Flat log lines without trace_id — cannot reconstruct multi-service agent runs.
  • Missing tool spans — LLM latency looks fine while external APIs stall.
  • Inconsistent step numbering — breaks replay and eval alignment.
  • Truncation invisible in traces — engineers debug “hallucinations” that were context drops.
  • Metrics without eval definitions — “success” in dashboards disagrees with QA.
  • 100% sampling at scale — cost explosion; trace backend becomes the bottleneck.
  • No link from trace to prompt version — cannot bisect regressions across deploys.

Production checklist

  • Assign one trace_id per agent task; propagate across all services.
  • Nest spans: agent.runllm.generate / tool.* / policy.*.
  • Record per-step structured envelope with action, observation digest, and latency.
  • Attribute input/output tokens and model id on every LLM span.
  • Store full observations in object storage; traces carry digests only.
  • Redact PII and secrets at ingest; hash user identifiers.
  • Sample successes; retain 100% of errors, policy denials, and canary traffic.
  • Export task success, steps-per-task, tool error rate, and cost-per-success metrics.
  • Build dashboards for latency breakdown and post-deploy step-count shifts.
  • Wire “replay in sandbox” from failed traces into CI golden tasks.
  • Tag traces with prompt_template_hash and agent_version.
  • Document break-glass observation access and audit requirements.

Key takeaways

  • Agents need trace semantics, not just chat transcripts — spans expose tool and budget failures.
  • Observation digests + object storage keep traces queryable without storing megabytes per run.
  • Harbor Analytics cut debug time 47 → 11 minutes with OTLP traces and sandbox replay.
  • Bridge traces to eval by promoting failed production clusters into golden tasks.
  • Sample intelligently — full fidelity on errors, statistical coverage on successes.

Related reading