Guide

Structured logging explained

Your payment service throws an error at 3 a.m. You grep the log file for timeout and get 40,000 lines — most unrelated, none showing which user or request failed. That is the failure mode of unstructured logging: human-readable strings that machines cannot reliably parse. Structured logging emits each event as a record with fixed fields (timestamp, level, message, context) — usually JSON — so log aggregators can filter, count, and alert on exact conditions. Combined with correlation IDs that follow a request across microservices, structured logs become the narrative backbone of observability. This guide covers why structure matters, field design, levels and semantics, propagation through distributed calls, aggregation pipelines, security and cost controls, and a production checklist.

Unstructured vs structured: the parsing problem

An unstructured log line looks like:

2026-06-08 14:22:01 ERROR Payment failed for user 48291 order ord_7x2 timeout after 30s

A human reads it fine. A machine must regex-extract user, order, and timeout — and breaks when someone changes the wording. A structured equivalent:

{
  "timestamp": "2026-06-08T14:22:01.342Z",
  "level": "error",
  "message": "payment_failed",
  "user_id": "48291",
  "order_id": "ord_7x2",
  "reason": "upstream_timeout",
  "duration_ms": 30012,
  "service": "payment-api",
  "trace_id": "a1b2c3d4e5f6"
}

Every field is queryable: reason=upstream_timeout AND service=payment-api returns exactly the events you need. Dashboards can chart timeout rates per service without brittle regex maintenance. The upfront cost is discipline in field naming and library choice; the payoff compounds as traffic and service count grow.

Core fields every log record should carry

Consistency across services matters more than any single schema. Adopt a baseline envelope every team uses:

timestamp — ISO 8601 UTC with millisecond precision; never rely on ingestion time alone.
level — debug, info, warn, error (see below for semantics).
message — a stable, snake_case event name (order_created), not a free-form sentence.
service — logical service name matching your deployment unit.
environment — prod, staging, dev; prevents staging noise in prod dashboards.
trace_id — ties into distributed tracing (W3C Trace Context).
span_id — optional; identifies the current operation within a trace.
request_id / correlation_id — client-facing ID for support lookups.

Event-specific fields attach as siblings: user_id, order_id, latency_ms, http_status. Avoid nesting deeply — flat JSON indexes faster in Elasticsearch and Loki. Use dot notation only when your aggregator expects it (http.method, db.statement_hash).

Cardinality warning: never put unbounded values (raw URLs with query strings, full SQL text, stack traces in the message field used for grouping) into high-cardinality label positions. Hash or truncate them; put full detail in a separate detail field excluded from metric aggregation.

Log levels: semantics that teams actually agree on

Inconsistent levels make alerts useless. Define and document:

debug — verbose internals useful during development; off in production unless temporarily enabled per-request via a debug flag.
info — normal lifecycle events: server started, request completed, job finished. Not every function entry.
warn — recoverable anomalies: retry succeeded, deprecated API called, slow query above threshold. No pages.
error — operation failed; user impact or data loss possible. Page if rate exceeds SLO.

A common mistake is logging every HTTP 404 as error. A missing favicon is info or omitted entirely. Reserve error for conditions your on-call engineer should investigate. Pair levels with SLO-based alerting — alert on error rates and burn, not single events.

Structured exceptions: when logging an error, include error.type, error.message, and a truncated error.stack field — not the entire stack in the human message string. Libraries like pino (Node), structlog (Python), and zerolog (Go) serialize exceptions automatically.

Correlation IDs and request context propagation

A single user checkout touches the API gateway, cart service, inventory service, and payment processor. Without a shared identifier, you cannot reconstruct the story. The fix:

Generate a correlation_id at the edge (API gateway or first service) if the client did not supply one.
Accept incoming X-Request-ID or W3C traceparent headers; do not regenerate silently.
Store the ID in request-scoped context (async local storage in Node, contextvars in Python, MDC in Java).
Attach it to every log line and outbound HTTP/gRPC call automatically via middleware.
Return the correlation ID in error responses so support can search logs instantly.

Trace vs correlation: a trace_id spans the full distributed tree (parent and child spans); a correlation_id is often a single flat ID per user-facing request. In practice, OpenTelemetry unifies them — inject the active span's trace_id into log records so logs and traces link in Grafana or Datadog with one click.

For background jobs and message consumers, propagate context from the message envelope (Kafka headers, SQS message attributes). A payment webhook handler should log the same order_id and trace_id as the API that created the order.

Centralized aggregation: from stdout to searchable store

Twelve-factor apps log to stdout; the platform ships logs elsewhere. Typical pipeline:

App → stdout (JSON lines) → agent (Fluent Bit, Vector, Promtail) → store (Elasticsearch, Loki, CloudWatch) → UI (Kibana, Grafana)

Elasticsearch + Kibana (ELK/EFK): full-text search, rich aggregations, higher cost and ops burden. Index templates map your JSON fields; ILM policies rotate hot/warm/cold tiers and delete old indices.

Grafana Loki: indexes labels (service, level, environment) not full text — cheaper at volume, pairs naturally with Grafana metrics and traces. LogQL queries filter by labels then grep content.

Cloud-native: AWS CloudWatch Logs, GCP Cloud Logging, Azure Monitor — managed, per-GB pricing, good enough until query needs outgrow them.

Whichever store you pick, enforce one JSON object per line (NDJSON). Multi-line stack traces must be escaped inside the JSON string, not split across lines — otherwise parsers treat each stack frame as a separate broken record.

Security, compliance, and cost control

Logs are a data breach waiting to happen if you dump secrets and PII freely.

Never log: passwords, API keys, session tokens, full credit card numbers, raw auth headers.
Redact at source: configure log libraries with deny-lists; scrub Authorization headers in middleware.
Hash identifiers when you need correlation but not reversibility (email → SHA-256 prefix).
Retention policies: 7–30 days hot for debugging, 90 days warm for compliance, delete beyond legal requirements.
Sampling: log 100% of errors, 1–10% of successful info events at high QPS. Head-based sampling at the edge; tail-based sampling (keep all slow/error traces) for tracing.
Rate limiting log volume: a loop logging inside a tight retry can fill disks and bankrupt your logging budget — cap per-request log count.

GDPR and similar regimes may classify IP addresses and user IDs as personal data. Document what you log, why, and retention in your privacy policy. See secrets management for keeping credentials out of config and logs entirely.

Performance: logging should not block requests

Synchronous disk I/O on the request path adds latency. Production patterns:

Async/buffered writers — batch writes to stdout; flush on process exit via graceful shutdown hooks.
Lazy evaluation — pass a function, not a pre-built string, so debug logs cost nothing when disabled.
Avoid logging in hot loops — aggregate counters in memory, log summaries per batch.
Serialize once — build the JSON object once per event; do not stringify large payloads unless the level permits.

Benchmark your logger under load. A poorly configured JSON serializer can consume more CPU than the business logic it describes.

Structured logging vs metrics vs traces

Signal	Best for	Weak at
Logs	Specific event detail, audit trail, debugging one request	Aggregating rates across millions of events (expensive)
Metrics	Dashboards, alerting on rates/latency percentiles	Explaining why one request was slow
Traces	End-to-end latency breakdown across services	Business context (order amount, user tier)

Use all three. Exemplars link metric data points to trace IDs; log records carry the same trace_id so you pivot from a spike in error rate (metric) to representative traces to individual log lines with full context.

Common mistakes

String interpolation in messages — log.info(f"User {id} paid") defeats structured search; use log.info("payment_completed", user_id=id).
Inconsistent field names — userId in one service, user_id in another breaks cross-service queries.
Logging success and failure at different levels inconsistently — if order_created is info, order_failed must be error with the same field set.
No schema governance — publish a log field catalog; CI lint warns on unknown fields or missing required envelope keys.
Treating logs as a database — logs are for operational debugging, not financial reporting; use an event store or warehouse for analytics.

Production checklist

Emit JSON (or equivalent key-value) lines to stdout in all services.
Define a shared envelope schema: timestamp, level, message, service, environment, trace_id.
Propagate correlation/trace IDs from edge through HTTP headers and message queues.
Document level semantics; ban logging 4xx client errors as error unless security-relevant.
Redact secrets and PII at the logger middleware layer.
Ship logs via an agent to a centralized store with retention and ILM policies.
Sample high-volume info logs; never sample errors.
Link logs to traces (shared trace_id) and metrics (exemplars or shared labels).
Alert on error rate SLO burn, not individual log lines.
Test log output in CI — snapshot a golden JSON line per critical event type.

Key takeaways

Structured logging makes logs machine-queryable — fixed fields instead of prose paragraphs.
Correlation and trace IDs stitch distributed requests into one debuggable story.
Level discipline keeps alerts meaningful; errors mean someone should act.
Security and sampling control cost and compliance without losing incident visibility.
Logs complement metrics and traces — use each for what it does best.