Guide
Distributed tracing and OpenTelemetry explained
Checkout latency jumped from 200 ms to 2.4 seconds. Your API gateway dashboard looks healthy. Each downstream service reports normal p95. The bottleneck is invisible because no single service owns the full path — the delay lives in the gaps between calls. Distributed tracing records every hop a request takes across your stack as a tree of timed spans, so you can see exactly which database query, cache miss, or RPC added 1.8 seconds. OpenTelemetry (OTel) is the vendor-neutral standard that instruments code once and exports traces (plus metrics and logs) to Jaeger, Grafana Tempo, Datadog, Honeycomb, or any OTLP-compatible backend. This guide covers traces and spans, W3C context propagation, manual vs auto-instrumentation, the collector pipeline, sampling strategies, pairing traces with structured logs, a checkout latency worked example, a backend decision table, common pitfalls, and a production checklist. For how tracing fits the broader observability stack, start there — this guide goes deep on traces alone.
Traces, spans, and the request waterfall
A trace represents one logical operation end to end — an HTTP
request, a Kafka message handler, a cron job. It has a unique
trace_id (128-bit hex) shared by every span in the tree.
A span is a single unit of work within that trace: an inbound HTTP handler, a PostgreSQL query, an outbound gRPC call to the inventory service. Each span records:
- span_id — unique within the trace.
- parent_span_id — links child spans into a tree (root span has no parent).
- name — operation identifier, e.g.
GET /checkoutordb.query. - start_time / end_time — wall-clock duration of the operation.
- status — OK, ERROR, or UNSET; error spans should set status and record the exception.
- attributes — key-value metadata:
http.method,db.statement(hashed),user.tier. - events — timestamped annotations within a span (e.g. "cache miss").
Visualized as a Gantt chart — the "waterfall" — spans reveal parallelism, serial bottlenecks, and unexpected fan-out. In a microservices checkout, you might see the payment service waiting 900 ms on inventory while fraud scoring finished in 40 ms. Metrics alone cannot show that relationship; logs can, but only if you manually correlate timestamps across six services.
Span kinds
OpenTelemetry classifies spans by kind:
- SERVER — inbound request handled by this service.
- CLIENT — outbound call this service makes.
- INTERNAL — in-process work (parsing, business logic).
- PRODUCER / CONSUMER — message queue publish and receive.
Correct kind assignment matters for service maps: CLIENT spans on service A should pair with SERVER spans on service B when propagation works.
Context propagation: carrying trace IDs across service boundaries
Tracing only works if every hop forwards the active trace context. The W3C Trace Context standard defines two HTTP headers:
traceparent— encodesversion-trace_id-parent_span_id-flags.tracestate— optional vendor-specific key-value pairs.
When the API gateway receives a request, it either continues an incoming trace (if the mobile client sent headers) or starts a new root span. Before calling the inventory service, the gateway injects the current span's context into outbound headers. The inventory service extracts it, creates a child SERVER span, and the tree grows.
Propagation must be wired for every transport:
- HTTP/REST and gRPC — header injection/extraction (OTel auto-instrumentation handles this for most frameworks).
- Message queues — embed context in Kafka record headers or RabbitMQ message properties; consumer creates a linked span.
- Background jobs — serialize context into the job payload when enqueueing; restore on worker execution.
- Async continuations — pass context explicitly across
asynciotasks, thread pools, and callbacks; context is not magically inherited in all runtimes.
Broken propagation produces orphan spans — fragments with
no parent that appear as separate traces in the UI. This is the most common
tracing bug in production and usually means someone called a service via a
path that skips instrumentation (a raw urllib call, an internal
load balancer hop, or a message without headers).
OpenTelemetry architecture: SDK, API, collector, backends
OpenTelemetry separates instrumentation from export so you can switch backends without rewriting code.
In-process SDK
Each service links the OTel SDK (available for Java, Go, Python, Node.js, .NET, Rust, and more). The SDK provides:
- TracerProvider — factory for tracers; configured with resource attributes (
service.name,deployment.environment). - Span processors — batch spans and hand them to exporters.
- Samplers — decide whether to record a trace (see below).
- Exporters — send OTLP (OpenTelemetry Protocol) over gRPC or HTTP to a collector or directly to a vendor.
Auto-instrumentation
Language agents can patch frameworks at runtime — Express, FastAPI, Spring Boot, Django, database drivers, Redis, HTTP clients — without code changes. Auto-instrumentation gets you 80% coverage in an afternoon. You still add manual spans around business-critical sections: "calculate shipping", "apply discount rules", "call payment gateway" — where latency hides inside a generic HTTP handler span.
Collector
The OpenTelemetry Collector is a standalone agent or gateway that receives OTLP, applies processors (batch, filter, attribute enrichment), and fans out to multiple backends. Running a collector per cluster (or per node as a sidecar) keeps vendor credentials out of application pods and lets you change export targets without redeploying apps.
Backends
Common open-source trace stores: Jaeger, Grafana Tempo (pairs with Loki logs and Prometheus metrics in Grafana), Zipkin. Commercial options (Datadog APM, Honeycomb, New Relic) accept OTLP natively. Pick based on retention cost, query UX, and correlation with your existing metrics/logs stack.
Sampling: capturing enough signal without blowing the budget
At 10,000 requests per second, storing every span for 30 days is expensive. Sampling decides which traces to keep.
Head-based sampling
The decision happens at trace start — typically a fixed percentage (e.g.
10%) or a rate limiter (100 traces/second). Simple and cheap,
but you might discard the one slow trace that mattered. Use
consistent probability sampling so the same trace_id
always gets the same decision across services (via the trace flags bit in
traceparent).
Tail-based sampling
The collector buffers spans and decides after the trace completes — keep all errors, all traces above 2 seconds, and a 1% sample of happy paths. Captures rare failures you would miss with pure head sampling, but needs more memory and adds export delay.
Practical defaults
- Development: sample 100% — cost is negligible.
- Staging: 50–100% for pre-release validation.
- Production: 1–10% head-based plus tail rules for errors and high latency.
- Always record (bypass sampling) for internal admin or canary traffic via a custom sampler checking a header or attribute.
Pair sampling policy with SLO error budgets: if latency SLO is burning, temporarily raise trace sample rate to debug faster.
Worked example: debugging a slow checkout trace
A user reports checkout took 6 seconds. Support pulls the
request_id from the receipt email. Your
structured logs
show trace_id=7f3a9c2b... on the API gateway line. You open
Jaeger, search by trace ID, and see:
- api-gateway GET /checkout — 6,120 ms total (root span).
- inventory-service gRPC GetStock — 4,890 ms (child of gateway).
- inventory-service db.query SELECT ... — 4,850 ms (child of gRPC span).
- payment-service POST /charge — 180 ms (parallel sibling).
- fraud-service Score — 95 ms (parallel sibling).
The waterfall shows payment and fraud finished quickly; the gateway blocked
on inventory. The DB span attributes reveal
db.statement=SELECT * FROM stock WHERE sku IN (...) with
4,200 bind parameters — a classic N+1 expanded into one giant query missing
an index. Metrics showed inventory p95 at 200 ms because most queries are
fast; this outlier was sampled into the trace.
Fix: add a covering index on (sku, warehouse_id), batch stock
lookups, and add a manual span around "resolve cart line items" so future
regressions are obvious even if the HTTP handler span looks normal.
Decision table: tracing approaches
| Scenario | Recommended approach | Why |
|---|---|---|
| Greenfield microservices | OTel auto-instrumentation + collector + Tempo/Jaeger | Vendor-neutral, fast rollout, full service maps |
| Monolith first trace adoption | OTel SDK + manual spans on top 5 endpoints | High ROI before splitting services; proves value cheaply |
| Already on Datadog/New Relic | OTel OTLP export to vendor | Unified instrumentation; escape partial vendor lock-in |
| High-volume edge (CDN, API gateway) | 1% head sampling + tail keep errors/slow | Controls storage cost; tail catches incidents |
| Async event pipeline | Propagate context in message headers + CONSUMER spans | Links producer and consumer into one trace |
| Debugging one user report | Lookup by trace_id or request_id in logs |
Logs and traces must share IDs — enforce in middleware |
Common pitfalls
- Orphan spans from broken propagation — audit every HTTP client, gRPC stub, and queue producer for context injection.
- High-cardinality attributes — never put raw user emails, full URLs with query strings, or unbounded IDs on spans used for aggregation; they explode backend index size.
- PII in traces — treat span attributes like logs; redact or hash sensitive fields. Traces often have longer retention than logs.
- 100% sampling in production — works until traffic doubles; set budgets and alert on collector export queue depth.
- Spans without status on errors — a 500 response with span status OK hides failures in error-rate dashboards.
- Ignoring
service.name— duplicate or missing service names make service maps useless. Set via environment variableOTEL_SERVICE_NAMEin every deployment. - Tracing without log correlation — if logs lack
trace_id, you cannot pivot from a metric alert to the exact trace.
Production checklist
- Set
service.name,service.version, anddeployment.environmenton every workload. - Enable auto-instrumentation for HTTP server, HTTP client, DB, and cache libraries.
- Add manual spans around top 3 business operations per service.
- Inject
trace_idandspan_idinto structured log records. - Deploy an OTel Collector with batch processor and OTLP exporter.
- Configure head sampling (1–10%) plus tail rules for errors and p99 outliers.
- Verify propagation with a multi-hop integration test in CI.
- Document trace search runbook: trace ID lookup, common queries, retention period.
- Alert on collector export failures and span queue saturation.
- Review cardinality monthly — drop unused attributes and expensive custom spans.
Key takeaways
- Distributed tracing shows end-to-end request latency as a tree of spans — essential for microservices debugging.
- OpenTelemetry instruments once and exports via OTLP to any backend; the collector decouples apps from vendors.
- W3C Trace Context (
traceparent) must propagate across every sync and async boundary or spans orphan. - Sampling balances cost and signal — combine head-based rates with tail-based keep rules for errors and slow traces.
- Traces deliver maximum value when paired with correlated structured logs and SLO-driven alert policies.
Related reading
- Observability explained — metrics, logs, and traces as three pillars
- Structured logging explained — correlation IDs and trace_id in JSON logs
- Microservices architecture explained — service boundaries where tracing matters most
- SLOs and error budgets explained — tie trace sampling to reliability targets