Guide

LLM observability explained

A user asks your support bot for a refund policy. The model answers in two seconds, sounds authoritative, and cites “Policy Doc v3.2” — which your knowledge base retired six weeks ago. Traditional APM shows a green latency bar; nothing explains the hallucinated citation. LLM observability is the discipline of making generative AI systems debuggable in production: tracing every step from user query through retrieval, routing, tool calls, and final completion; measuring tokens, cost, and error rates per request; and sampling outputs for quality drift before users post screenshots. It extends general observability with LLM-specific signals — prompt versions, retrieval chunks, model IDs, and faithfulness scores — and pairs with offline evaluation suites and cost dashboards. This guide covers what to instrument, tracing patterns for chains and agents, logging and privacy, RAG pipeline visibility, production quality monitoring, a Harbor Support ops dashboard worked example, a tooling decision table, common pitfalls, and a production checklist.

Why LLM apps need different observability

A REST API either returns 200 or 500. An LLM endpoint almost always returns 200 with text that may be wrong, unsafe, or irrelevant — a semantic failure invisible to HTTP status codes. Three properties make LLM systems harder to operate than typical microservices:

Non-determinism — the same prompt can produce different outputs across temperature settings, model versions, or provider routing changes.
Multi-step pipelines — a single user message may trigger embedding, vector search, reranking, two model calls, a tool invocation, and a guardrail check. Failures hide inside the chain.
Cost scales with content — latency and spend correlate with token count, not just request count. A prompt injection that dumps your entire RAG index is both a security incident and a billing spike.

LLM observability answers operational questions benchmarks cannot: Which retrieval chunks led to this wrong answer? Did latency jump after we swapped embedding models? Are thumbs-down tickets clustering on a specific intent? Without traces and structured logs, you are guessing from aggregate error rates.

The three pillars adapted for LLMs

Map familiar observability pillars to LLM workloads:

Metrics

Latency — end-to-end request time plus per-span breakdown (embed, retrieve, generate, guardrail).
Token counts — input, output, and cached tokens per model call; aggregate daily spend by feature flag or tenant.
Throughput — requests per second, queue depth, rate-limit hits from providers.
Quality proxies — user thumbs up/down rate, escalation to human rate, guardrail block rate, empty-retrieval rate in RAG.
Cache metrics — semantic and prompt cache hit ratio; see semantic caching.

Logs

Structured JSON logs per request with correlation IDs linking to traces. Log model name, prompt template version, retrieval document IDs, tool names invoked, and completion token count. Never log raw PII or secrets — hash user IDs, redact emails and account numbers before storage, and set retention policies aligned with GDPR and your privacy policy.

Traces

Distributed traces represent each LLM call, retrieval step, and agent tool use as nested spans. A parent span for handle_support_ticket contains children for embed_query, vector_search, rerank, llm_generate, and output_guardrail. When latency spikes, the waterfall shows whether retrieval or generation regressed. Frameworks like LangChain integrate LangSmith; OpenTelemetry exporters work with Phoenix, Honeycomb, and Grafana Tempo via OpenTelemetry.

Tracing chains, agents, and multi-model flows

Instrument at boundaries where behavior branches:

Request ingress — assign trace_id and session_id; attach user segment (free vs paid) as span attributes, not raw identity.
Router / classifier — log chosen model tier, intent label, and confidence. Essential when small models handle easy tickets and frontier models handle escalations.
RAG retrieval — record query embedding model, top-k chunk IDs, similarity scores, and whether reranking changed order. Store chunk text in the trace only if policy allows; otherwise log IDs and fetch on demand during incident review.
LLM generation — span attributes: provider, model ID, temperature, max tokens, prompt hash (not full prompt in high-volume paths), completion length, finish reason (stop vs length).
Tool and agent steps — each tool call is its own span with arguments summary, latency, and success/failure. Agent loops can produce dozens of spans per user message; cap depth and flag runaway loops.
Guardrails — log blocked inputs/outputs with rule ID, not necessarily full toxic content.

Propagate trace context across async workers and queue consumers so a ticket processed five minutes later still links to the original user session.

RAG-specific visibility

RAG failures are the most common production LLM incidents: empty retrieval, wrong chunks ranked first, stale documents not re-indexed, or the model ignoring context entirely. Instrument these RAG-specific signals:

Retrieval recall@k on sampled queries with known gold documents (offline) plus live empty-hit rate when no chunk exceeds similarity threshold.
Context utilization — did the model cite chunk IDs present in the prompt? Mismatch indicates hallucination or prompt layout issues.
Index freshness — metric for document age at retrieval time; alert when >90% of hits are older than your SLA.
Chunk overlap and duplication — high duplicate scores in top-k waste context window and confuse models.

Build a “RAG replay” tool: given a trace_id, re-run retrieval with current index and diff results. Essential after re-embedding or index migrations.

Production quality monitoring

Offline benchmarks catch regressions before deploy; production monitoring catches drift benchmarks miss — new user phrasing, seasonal ticket types, provider silent model updates.

Human feedback loops

Thumbs up/down buttons, “was this helpful?” prompts, and explicit escalation to human agents. Tag feedback with trace_id for root-cause analysis. Watch for feedback sparsity: 0.5% response rate still surfaces clusters at scale.

Automated sampling and LLM-as-judge

Sample 1–5% of production traffic for async evaluation: faithfulness to retrieved context, toxicity, PII leakage, and task completion. Use a smaller judge model to score outputs against rubrics. Alert when weekly rolling averages drop below thresholds. Keep judge prompts versioned and re-baseline after judge model changes.

Canary and shadow traffic

Route a fraction of requests to a candidate prompt or model; compare quality metrics and cost before full cutover. Shadow mode runs the candidate without returning its output to users — useful for high-stakes flows.

Privacy, retention, and compliance

Full prompt logging is a liability. Practices that balance debuggability and compliance:

Tiered storage — hot traces (7 days) with full detail for on-call; warm archive (30 days) with redacted prompts; cold aggregates only after that.
Field-level redaction — regex and NER pipelines strip emails, phone numbers, credit cards, and API keys before write.
Opt-in debug mode — power users consent to verbose logging for support sessions.
Access controls — trace viewers require SSO and audit logs; production prompts are not a shared Slack dump.

Document retention in your privacy policy. Observability data is personal data when prompts contain user messages.

Worked example: Harbor Support ops dashboard

Harbor Support is a fictional B2B helpdesk with a RAG chatbot and human escalation queue. Their observability stack after a bad deploy (refund answers started citing deprecated policies) includes:

OpenTelemetry SDK in the API gateway emitting spans to Grafana Tempo; Prometheus scrapes token and latency histograms.
LangSmith project for LangChain chains — developers replay failing traces with one click during incidents.
Custom metrics: rag_empty_retrieval_total, guardrail_block_total, escalation_total, tokens_spent_usd by tenant.
Weekly sample eval — 2,000 random traces scored for faithfulness; results stored in BigQuery joined to document versions.
Alert rules — PagerDuty if p95 latency > 8s for 10 min, faithfulness score drops > 8% week-over-week, or daily token spend exceeds budget by 20%.

Incident postmortem: thumbs-down on “refund” intent spiked 340%. Trace sample showed retrieval returning Policy v2 chunks because the vector index was not rebuilt after CMS publish. Fix: CI job re-indexes on content webhook; new metric max_retrieved_doc_age_hours alerts when stale. Mean time to detect dropped from days (user reports) to 22 minutes (alert).

Tooling decision table

Need	Approach	Trade-off
Fast start with LangChain apps	LangSmith tracing + datasets	Vendor lock-in; easy replay and eval UI
Unified metrics/logs/traces with existing Grafana stack	OpenTelemetry + Prometheus + Tempo/Loki	More setup; full control and no per-trace SaaS fees at scale
Open-source LLM trace UI	Arize Phoenix, Langfuse, Helicone	Self-host or cloud; compare feature sets for agent support
Provider-native dashboards	OpenAI Usage, Anthropic Console, Bedrock metrics	Token/cost only; no RAG or agent step visibility
Production quality scoring	Sampled LLM-as-judge pipeline + human review queue	Judge cost and bias; calibrate against human labels quarterly
Cost attribution by feature	Custom span attributes (`feature_id`, `tenant_id`) aggregated in warehouse	Requires discipline at every call site
Security and abuse detection	Guardrail spans + anomaly alerts on token spikes per session	False positives on long legitimate threads; tune thresholds

Common pitfalls

Logging full prompts at scale — storage cost, PII exposure, and unusable noise. Hash prompts and sample aggressively.
Metrics without trace linkage — knowing error rate rose tells you nothing without trace_id on user feedback.
Ignoring embedding and retrieval spans — blaming the LLM when retrieval returned garbage.
No model version in spans — provider silent upgrades look like mysterious quality drift.
Alert fatigue on latency only — semantic quality alerts matter as much as p99 milliseconds.
Eval sampling too low — 0.01% misses rare but high-impact failure modes (legal, medical, billing).
Retention forever — compliance risk and query performance decay.
Observability as afterthought — retrofitting traces into agent loops costs more than instrumenting from day one.

Practitioner checklist

Assign trace_id at API ingress; propagate through queues and workers.
Instrument spans for embed, retrieve, rerank, generate, tool call, and guardrail.
Record model ID, provider, token counts, and prompt template version on every LLM span.
Redact PII before log write; define retention tiers and document in privacy policy.
Dashboard p50/p95 latency, token spend, error rate, and cache hit ratio.
Track RAG empty-retrieval rate and max document age at hit time.
Wire thumbs-down and escalations to trace_id for replay.
Sample 1–5% of traffic for automated faithfulness and safety scoring.
Alert on quality metric regression, not just latency and 5xx errors.
Build RAG replay tooling for index and embedding migrations.
Run incident drills: can on-call find the retrieval chunks for a reported bad answer in under 10 minutes?

Key takeaways

LLM observability adds semantic failure detection to traditional metrics — wrong answers return HTTP 200.
Traces through retrieval, routing, generation, and tools are essential for debugging multi-step pipelines.
Token and cost metrics must be first-class, attributed by feature and tenant.
Production quality monitoring combines human feedback, sampled evals, and canary deploys.
Privacy-aware logging — redact, retain minimally, and control access — is non-negotiable for customer-facing AI.