Guide

Distributed tracing and OpenTelemetry explained

Checkout latency jumped from 200 ms to 2.4 seconds. Your API gateway dashboard looks healthy. Each downstream service reports normal p95. The bottleneck is invisible because no single service owns the full path — the delay lives in the gaps between calls. Distributed tracing records every hop a request takes across your stack as a tree of timed spans, so you can see exactly which database query, cache miss, or RPC added 1.8 seconds. OpenTelemetry (OTel) is the vendor-neutral standard that instruments code once and exports traces (plus metrics and logs) to Jaeger, Grafana Tempo, Datadog, Honeycomb, or any OTLP-compatible backend. This guide covers traces and spans, W3C context propagation, manual vs auto-instrumentation, the collector pipeline, sampling strategies, pairing traces with structured logs, a checkout latency worked example, a backend decision table, common pitfalls, and a production checklist. For how tracing fits the broader observability stack, start there — this guide goes deep on traces alone.

Traces, spans, and the request waterfall

A trace represents one logical operation end to end — an HTTP request, a Kafka message handler, a cron job. It has a unique trace_id (128-bit hex) shared by every span in the tree.

A span is a single unit of work within that trace: an inbound HTTP handler, a PostgreSQL query, an outbound gRPC call to the inventory service. Each span records:

  • span_id — unique within the trace.
  • parent_span_id — links child spans into a tree (root span has no parent).
  • name — operation identifier, e.g. GET /checkout or db.query.
  • start_time / end_time — wall-clock duration of the operation.
  • status — OK, ERROR, or UNSET; error spans should set status and record the exception.
  • attributes — key-value metadata: http.method, db.statement (hashed), user.tier.
  • events — timestamped annotations within a span (e.g. "cache miss").

Visualized as a Gantt chart — the "waterfall" — spans reveal parallelism, serial bottlenecks, and unexpected fan-out. In a microservices checkout, you might see the payment service waiting 900 ms on inventory while fraud scoring finished in 40 ms. Metrics alone cannot show that relationship; logs can, but only if you manually correlate timestamps across six services.

Span kinds

OpenTelemetry classifies spans by kind:

  • SERVER — inbound request handled by this service.
  • CLIENT — outbound call this service makes.
  • INTERNAL — in-process work (parsing, business logic).
  • PRODUCER / CONSUMER — message queue publish and receive.

Correct kind assignment matters for service maps: CLIENT spans on service A should pair with SERVER spans on service B when propagation works.

Context propagation: carrying trace IDs across service boundaries

Tracing only works if every hop forwards the active trace context. The W3C Trace Context standard defines two HTTP headers:

  • traceparent — encodes version-trace_id-parent_span_id-flags.
  • tracestate — optional vendor-specific key-value pairs.

When the API gateway receives a request, it either continues an incoming trace (if the mobile client sent headers) or starts a new root span. Before calling the inventory service, the gateway injects the current span's context into outbound headers. The inventory service extracts it, creates a child SERVER span, and the tree grows.

Propagation must be wired for every transport:

  • HTTP/REST and gRPC — header injection/extraction (OTel auto-instrumentation handles this for most frameworks).
  • Message queues — embed context in Kafka record headers or RabbitMQ message properties; consumer creates a linked span.
  • Background jobs — serialize context into the job payload when enqueueing; restore on worker execution.
  • Async continuations — pass context explicitly across asyncio tasks, thread pools, and callbacks; context is not magically inherited in all runtimes.

Broken propagation produces orphan spans — fragments with no parent that appear as separate traces in the UI. This is the most common tracing bug in production and usually means someone called a service via a path that skips instrumentation (a raw urllib call, an internal load balancer hop, or a message without headers).

OpenTelemetry architecture: SDK, API, collector, backends

OpenTelemetry separates instrumentation from export so you can switch backends without rewriting code.

In-process SDK

Each service links the OTel SDK (available for Java, Go, Python, Node.js, .NET, Rust, and more). The SDK provides:

  • TracerProvider — factory for tracers; configured with resource attributes (service.name, deployment.environment).
  • Span processors — batch spans and hand them to exporters.
  • Samplers — decide whether to record a trace (see below).
  • Exporters — send OTLP (OpenTelemetry Protocol) over gRPC or HTTP to a collector or directly to a vendor.

Auto-instrumentation

Language agents can patch frameworks at runtime — Express, FastAPI, Spring Boot, Django, database drivers, Redis, HTTP clients — without code changes. Auto-instrumentation gets you 80% coverage in an afternoon. You still add manual spans around business-critical sections: "calculate shipping", "apply discount rules", "call payment gateway" — where latency hides inside a generic HTTP handler span.

Collector

The OpenTelemetry Collector is a standalone agent or gateway that receives OTLP, applies processors (batch, filter, attribute enrichment), and fans out to multiple backends. Running a collector per cluster (or per node as a sidecar) keeps vendor credentials out of application pods and lets you change export targets without redeploying apps.

Backends

Common open-source trace stores: Jaeger, Grafana Tempo (pairs with Loki logs and Prometheus metrics in Grafana), Zipkin. Commercial options (Datadog APM, Honeycomb, New Relic) accept OTLP natively. Pick based on retention cost, query UX, and correlation with your existing metrics/logs stack.

Sampling: capturing enough signal without blowing the budget

At 10,000 requests per second, storing every span for 30 days is expensive. Sampling decides which traces to keep.

Head-based sampling

The decision happens at trace start — typically a fixed percentage (e.g. 10%) or a rate limiter (100 traces/second). Simple and cheap, but you might discard the one slow trace that mattered. Use consistent probability sampling so the same trace_id always gets the same decision across services (via the trace flags bit in traceparent).

Tail-based sampling

The collector buffers spans and decides after the trace completes — keep all errors, all traces above 2 seconds, and a 1% sample of happy paths. Captures rare failures you would miss with pure head sampling, but needs more memory and adds export delay.

Practical defaults

  • Development: sample 100% — cost is negligible.
  • Staging: 50–100% for pre-release validation.
  • Production: 1–10% head-based plus tail rules for errors and high latency.
  • Always record (bypass sampling) for internal admin or canary traffic via a custom sampler checking a header or attribute.

Pair sampling policy with SLO error budgets: if latency SLO is burning, temporarily raise trace sample rate to debug faster.

Worked example: debugging a slow checkout trace

A user reports checkout took 6 seconds. Support pulls the request_id from the receipt email. Your structured logs show trace_id=7f3a9c2b... on the API gateway line. You open Jaeger, search by trace ID, and see:

  • api-gateway GET /checkout — 6,120 ms total (root span).
  • inventory-service gRPC GetStock — 4,890 ms (child of gateway).
  • inventory-service db.query SELECT ... — 4,850 ms (child of gRPC span).
  • payment-service POST /charge — 180 ms (parallel sibling).
  • fraud-service Score — 95 ms (parallel sibling).

The waterfall shows payment and fraud finished quickly; the gateway blocked on inventory. The DB span attributes reveal db.statement=SELECT * FROM stock WHERE sku IN (...) with 4,200 bind parameters — a classic N+1 expanded into one giant query missing an index. Metrics showed inventory p95 at 200 ms because most queries are fast; this outlier was sampled into the trace.

Fix: add a covering index on (sku, warehouse_id), batch stock lookups, and add a manual span around "resolve cart line items" so future regressions are obvious even if the HTTP handler span looks normal.

Decision table: tracing approaches

Scenario Recommended approach Why
Greenfield microservices OTel auto-instrumentation + collector + Tempo/Jaeger Vendor-neutral, fast rollout, full service maps
Monolith first trace adoption OTel SDK + manual spans on top 5 endpoints High ROI before splitting services; proves value cheaply
Already on Datadog/New Relic OTel OTLP export to vendor Unified instrumentation; escape partial vendor lock-in
High-volume edge (CDN, API gateway) 1% head sampling + tail keep errors/slow Controls storage cost; tail catches incidents
Async event pipeline Propagate context in message headers + CONSUMER spans Links producer and consumer into one trace
Debugging one user report Lookup by trace_id or request_id in logs Logs and traces must share IDs — enforce in middleware

Common pitfalls

  • Orphan spans from broken propagation — audit every HTTP client, gRPC stub, and queue producer for context injection.
  • High-cardinality attributes — never put raw user emails, full URLs with query strings, or unbounded IDs on spans used for aggregation; they explode backend index size.
  • PII in traces — treat span attributes like logs; redact or hash sensitive fields. Traces often have longer retention than logs.
  • 100% sampling in production — works until traffic doubles; set budgets and alert on collector export queue depth.
  • Spans without status on errors — a 500 response with span status OK hides failures in error-rate dashboards.
  • Ignoring service.name — duplicate or missing service names make service maps useless. Set via environment variable OTEL_SERVICE_NAME in every deployment.
  • Tracing without log correlation — if logs lack trace_id, you cannot pivot from a metric alert to the exact trace.

Production checklist

  • Set service.name, service.version, and deployment.environment on every workload.
  • Enable auto-instrumentation for HTTP server, HTTP client, DB, and cache libraries.
  • Add manual spans around top 3 business operations per service.
  • Inject trace_id and span_id into structured log records.
  • Deploy an OTel Collector with batch processor and OTLP exporter.
  • Configure head sampling (1–10%) plus tail rules for errors and p99 outliers.
  • Verify propagation with a multi-hop integration test in CI.
  • Document trace search runbook: trace ID lookup, common queries, retention period.
  • Alert on collector export failures and span queue saturation.
  • Review cardinality monthly — drop unused attributes and expensive custom spans.

Key takeaways

  • Distributed tracing shows end-to-end request latency as a tree of spans — essential for microservices debugging.
  • OpenTelemetry instruments once and exports via OTLP to any backend; the collector decouples apps from vendors.
  • W3C Trace Context (traceparent) must propagate across every sync and async boundary or spans orphan.
  • Sampling balances cost and signal — combine head-based rates with tail-based keep rules for errors and slow traces.
  • Traces deliver maximum value when paired with correlated structured logs and SLO-driven alert policies.

Related reading