Guide

Observability explained: metrics, logs, and distributed tracing

Observability is the ability to understand what a running system is doing from the data it emits — without redeploying code or guessing from a single dashboard. Production services fail in ways you did not anticipate; observability gives you enough context to ask new questions after an incident. The modern toolkit rests on three pillars: metrics (aggregated numbers over time), logs (discrete events with detail), and traces (end-to-end request paths across services). This guide explains how each pillar works, when to use which, how they connect through correlation IDs, and what to instrument first so alerts mean something instead of waking you at 3 a.m. for noise.

Observability vs monitoring

Monitoring watches known failure modes: CPU above 90%, error rate above 1%, disk nearly full. You decide what to measure upfront and alert when thresholds break. That works until a novel bug appears — a slow dependency, a partial outage, a race condition that only shows up under load.

Observability assumes you cannot enumerate every failure in advance. Instrumentation should be rich enough that, when something looks wrong, you can slice data by user, endpoint, region, or deployment version and follow a request from browser to database. Monitoring tells you that the site is red; observability helps you learn why and where.

In practice most teams blend both: SLO-based alerts on golden signals, plus deep logs and traces for investigation. The goal is not more dashboards — it is faster mean time to recovery (MTTR) and fewer blind spots.

The three pillars in brief

Metrics

Metrics are numeric time series: request count, latency percentiles, queue depth, RPC error rate. They are cheap to store and query at scale, which makes them ideal for dashboards and automated alerts. You lose per-request detail — a spike in p99 latency tells you something slowed down, not which user or query caused it.

Logs

Logs record individual events: "payment verified", "RPC returned 429", "database connection refused". Unstructured printf-style logs are hard to search; structured logging (JSON with consistent field names) lets you filter status=500 AND route=/api/settle in seconds. Logs carry context metrics cannot — stack traces, payload sizes, upstream error bodies — but volume and cost grow quickly without sampling and retention policies.

Traces

A distributed trace follows one logical operation — an HTTP request, a webhook delivery, a blockchain settlement — as it hops through services, queues, and databases. Each step is a span with start time, duration, and metadata. Traces reveal hidden latency: 200 ms in your API might be 180 ms waiting on a downstream RPC. Tools like Jaeger, Zipkin, and cloud APMs visualize trace trees; OpenTelemetry has become the standard for emitting them.

Metrics that actually matter: RED and USE

Instrument everything and you drown in graphs. Two frameworks keep metrics focused:

RED — for request-driven services

Rate — requests per second (traffic volume)
Errors — failed requests as a fraction of total (5xx, timeouts, business failures)
Duration — latency distribution (p50, p95, p99 — averages lie)

Apply RED per route or per dependency: GET /health, POST /settle, solana.getBalance. When error rate climbs on one route while others stay flat, you know where to look without reading every log line.

USE — for resources (CPU, memory, disks, connections)

Utilization — percent of capacity used
Saturation — work waiting (queue length, thread pool backlog)
Errors — hardware or driver-level failures

High utilization with low saturation is fine; high saturation with rising latency means you are out of headroom even if CPU looks only moderately busy — a common pattern with I/O-bound RPC clients.

Metric types in Prometheus-style systems

Counters only go up (total requests). Gauges go up and down (memory used, open connections). Histograms bucket observations for percentile calculation (request duration). Pick the right type: never reset a counter manually; use a gauge for values that decrease.

Structured logging done right

Every log line should answer: what happened, when, where, and for whom? Standard fields help:

timestamp — ISO 8601 in UTC
level — debug, info, warn, error (reserve error for actionable problems)
service — which binary or container emitted the line
trace_id / request_id — ties logs to traces and other services
message — human-readable summary
Domain fields — user_id, order_id, rpc_endpoint, latency_ms

Log at boundaries: request received, external call started/finished, business state transition, unrecoverable failure. Avoid logging full credit card numbers, private keys, or raw JWTs — redact secrets and hash identifiers where regulation requires it.

Pair logging with sensible defaults: info in production, debug only when troubleshooting (feature-flagged or per-request). Ship logs to a centralized store (Loki, Elasticsearch, CloudWatch) with retention tiers — hot storage for seven days of investigation, cold archive for compliance if needed.

Distributed tracing and correlation

When a user clicks "Pay" and nothing settles, the failure might live in the browser, your API, a queue consumer, or an RPC node. Tracing connects those hops.

Spans and context propagation

The entry service creates a root span and generates a trace ID. Outbound HTTP calls attach that ID (commonly via traceparent W3C headers or vendor equivalents) so downstream services create child spans. Async work — publishing to Kafka, enqueueing a webhook — must propagate context manually or traces break at the first queue.

OpenTelemetry (OTel)

OpenTelemetry provides vendor-neutral SDKs for traces, metrics, and logs in one pipeline. Instrument your app once, export to Jaeger, Datadog, Honeycomb, or Grafana Tempo by changing the collector config. Auto-instrumentation covers common HTTP and database libraries; custom spans mark business-critical sections ("verify_on_chain_payment").

Correlation IDs for simpler stacks

Not every project needs full tracing on day one. A lightweight correlation ID — a UUID generated at the edge and passed through headers, log fields, and error responses — already links logs across two or three services. Upgrade to spans when cross-service latency becomes opaque.

SLOs, alerts, and avoiding pager fatigue

An alert should mean "a user or business outcome is at risk, and a human should act now." Define Service Level Objectives (SLOs) from the user perspective: "99.9% of payment verifications complete in under 2 seconds over a 30-day window."

Alert on error budget burn — how fast you are consuming the allowed 0.1% failure slice — not on every CPU blip. Symptom-based alerts (checkout success rate dropped) outperform cause-based alerts (pod restarted) because they catch problems your runbook did not anticipate.

Layer defenses: metrics trigger investigation, logs provide evidence, traces show the critical path. Runbooks linked from alert messages should say what dashboard to open and which log query to run — future you at 3 a.m. will thank present you.

What to instrument first

Start with the paths that move money or data users care about, then expand:

Edge HTTP — RED metrics per route, request ID in access logs
External dependencies — latency and error rate per RPC provider, database, payment API; log fallback events when you switch endpoints
Queues and workers — consumer lag, processing duration, dead-letter count
Business events — orders created, payments confirmed, payouts sent (counters you can reconcile against ledger)

For blockchain-adjacent services, treat RPC like any other dependency: track 429 rate limits, endpoint failover, confirmation latency by commitment level. Our Solana RPC endpoints guide covers health-check patterns that belong in the same dashboards as your application metrics.

When dependencies fail repeatedly, pair observability with resilience patterns — timeouts, bulkheads, and circuit breakers — described in our circuit breakers explainer. Metrics should show when breakers open so you know protection kicked in rather than silently dropping traffic.

Common mistakes

Alerting on averages — p99 latency can triple while the mean looks fine.
Missing high-cardinality labels on metrics — never put user IDs or wallet addresses as Prometheus labels; use logs or traces for that detail.
Logging secrets — one pasted API key in a log stream becomes a permanent leak.
Broken trace context across async boundaries — the most common reason traces stop at the message queue.
Dashboards without ownership — every chart should answer a question someone actually asks during incidents.
Ignoring cost — log and trace volume can exceed compute cost; sample traces in steady state, keep full capture on errors.

Key takeaways

Observability lets you debug unknown failures; monitoring watches known thresholds.
Use metrics for alerts and trends, logs for evidence, traces for cross-service latency.
RED fits request services; USE fits infrastructure resources.
Structured logs with correlation IDs are the highest-leverage first step.
Alert on SLOs and user-visible symptoms, not every infrastructure twitch.
OpenTelemetry unifies instrumentation; start at the edges and money paths, then go deeper.