Guide

Health checks, liveness and readiness probes explained

A health check is a lightweight request that answers one question: is this instance safe to keep running, and should it receive traffic? Orchestrators like Kubernetes, cloud load balancers, and service meshes all poll these endpoints on a schedule. Get the semantics wrong and you get restart loops that amplify outages, pods that serve 500s while marked healthy, or deploys that never finish because readiness never flips green. This guide separates liveness, readiness, and startup probes, explains shallow vs deep checks, covers timing and failure thresholds, ties health checks to graceful shutdown and load balancer routing, and ends with a production checklist grounded in observability practice.

Three probe types, three different jobs

Kubernetes popularized the vocabulary, but the concepts apply anywhere an orchestrator manages replicas behind a load balancer. Each probe type has a distinct action when it fails — confusing them is the most common production mistake.

Liveness probe: is the process stuck?

A liveness probe asks whether the main process is alive and making forward progress. Failure triggers a restart of the container. Use liveness only for deadlocks, infinite loops, or memory leaks that leave the HTTP server responding but the app unable to recover. Do not put dependency checks (database, cache, third-party API) on liveness — a transient Postgres blip would restart every pod simultaneously, making the outage worse.

Readiness probe: should this instance get traffic?

A readiness probe gates traffic. When it fails, the pod stays running but is removed from Service endpoints and load balancer target groups. This is where you verify dependencies: can the app reach the database, has the cache warmed, is the feature flag client initialized? Readiness failure during a rolling deploy is normal — the old pod drains while the new one warms up.

Startup probe: slow boot without premature kills

Cold starts — JVM heap allocation, loading ML models, running database migrations — can exceed default liveness timeouts. A startup probe disables liveness checks until it succeeds once, giving the process a longer budget to boot. After startup passes, liveness and readiness take over. Without a startup probe, Kubernetes may restart a pod that would have become healthy ten seconds later.

What to check: shallow, deep, and dependency tiers

Most teams expose one or more HTTP routes. Naming varies; semantics matter more than the path string.

Shallow (process-level) checks

Return 200 OK if the event loop is responsive and the process has not deadlocked. Typical liveness implementation: a handler that returns {"status":"ok"} without touching external systems. Cost should be microseconds — these run every few seconds on every replica.

Deep (dependency-level) checks

Readiness endpoints often run a deep health check: ping the primary database with SELECT 1, verify Redis responds to PING, confirm an object store bucket is reachable. Keep each dependency check fast (sub-100 ms with tight timeouts) and parallelize them. Return structured JSON listing each dependency and its status so on-call engineers see which link broke:

{"status":"degraded","checks":{"postgres":"ok","redis":"timeout"}}

Decide policy explicitly: does one degraded dependency fail readiness entirely, or do you serve partial functionality? Document the rule — "Redis down → fail readiness" vs "Redis down → serve stale cache" changes user impact.

Separate metrics from health

Do not overload health endpoints with business metrics (queue depth, error rate). Those belong in Prometheus gauges and alerting rules tied to SLO error budgets. Health checks answer binary routing questions; metrics answer trend questions.

Kubernetes probe configuration

Probes support HTTP GET, TCP socket, gRPC, or exec commands. HTTP is the default for web services. Key fields in the pod spec:

initialDelaySeconds — wait before the first probe (avoid false failures during boot).
periodSeconds — interval between probes (typically 5–10 s for liveness, 5 s for readiness).
timeoutSeconds — per-probe deadline (1–3 s for shallow; up to 5 s for deep readiness if parallelized).
failureThreshold — consecutive failures before action (3 failures × 10 s period = 30 s before restart).
successThreshold — consecutive successes to flip ready (1 for liveness; sometimes 2 for readiness to avoid flapping).

Example timing for a typical API

Liveness: HTTP /healthz, shallow, periodSeconds: 10, failureThreshold: 3 — three missed heartbeats before restart. Readiness: HTTP /readyz, deep, periodSeconds: 5, failureThreshold: 2 — removed from Service after ~10 s of dependency failure. Startup: same path as liveness, failureThreshold: 30 with periodSeconds: 10 — up to five minutes for slow boot, then hand off.

HTTP status codes and bodies

Return 200 for healthy/ready, 503 Service Unavailable for not ready. Some teams use 500 on liveness failure — either works if consistent. Avoid redirects (301/302) on probe paths; kubelet follows redirects and may hit the wrong handler.

Load balancer and service mesh integration

Cloud load balancers (ALB, NLB, GCP HTTP LB) run their own health checks independent of Kubernetes. A pod can pass kubelet readiness while the ALB still marks the target unhealthy if paths, ports, or security groups differ.

Align paths and ports — ALB health check path should match readiness (often /readyz or /health).
Healthy threshold vs interval — ALB defaults (2 successes / 30 s interval) can delay traffic longer than Kubernetes readiness. Tune for your deploy speed.
Connection draining — when readiness fails, in-flight requests should complete. Pair probe failure with preStop hooks so the LB stops sending new connections before SIGTERM.
Service mesh sidecars — Envoy may expose separate admin health; ensure app readiness reflects app dependencies, not just sidecar bootstrap.

Failure modes and anti-patterns

Restart storms from dependency checks on liveness

Putting Postgres on liveness killed an entire fleet when the database had a 60-second failover. Every pod restarted, re-opened connection pools, and amplified load on the recovering primary. Move dependency checks to readiness only.

Probe flapping during GC pauses

A stop-the-world GC pause longer than timeoutSeconds fails liveness and restarts the JVM — right when it was about to finish collecting. Increase timeout slightly, add startup probe budget, or switch to TCP socket liveness that only verifies the port accepts connections.

Health endpoint on the public internet

Exposing deep readiness JSON publicly leaks infrastructure details (hostnames, internal service names). Bind admin/health routes to a separate port or restrict by network policy; load balancers reach them on internal interfaces.

Shared health state across replicas

Writing probe results to a shared cache creates false positives — one replica marks all as healthy. Health must be evaluated per instance.

Ignoring startup ordering in init containers

If migrations run in an init container, readiness should not pass until the app version matches the schema version. Otherwise traffic hits code expecting columns that do not exist yet.

Probe type decision table

Scenario	Probe type	What to check	Failure action
Event loop deadlock	Liveness	Shallow HTTP / TCP port open	Restart container
Database unreachable	Readiness	`SELECT 1` with 1 s timeout	Remove from load balancer
JVM cold start (2 min)	Startup	Same as liveness, high failureThreshold	Suppress liveness until pass
Rolling deploy warmup	Readiness	Cache warm + dependency OK	Hold traffic until green
Memory leak (slow death)	Liveness + metrics alert	Shallow + OOM watch via Prometheus	Restart or scale
Dependency degraded but optional	Readiness (policy)	Partial check JSON	Team-defined: serve or drain

Implementation checklist

Expose separate /healthz (liveness, shallow) and /readyz (readiness, deep) endpoints.
Never put external dependency checks on liveness.
Add a startup probe if boot takes more than 30 seconds.
Parallelize dependency checks; enforce per-check timeouts under 2 s.
Return structured JSON on readiness failures for faster incident triage.
Align Kubernetes readiness with cloud load balancer health check paths.
Pair readiness failure with connection draining and preStop hooks.
Load-test probe overhead — thousands of replicas × 5 s interval adds up.
Alert on readiness failure rate, not individual blips during deploys.
Document degraded-mode policy for each dependency.

Key takeaways

Liveness restarts stuck processes; keep it shallow and dependency-free.
Readiness gates traffic — use it for database, cache, and warmup checks.
Startup probes protect slow-boot apps from premature liveness kills.
Align orchestrator and load balancer health paths so deploys do not flap between systems.
Health checks route traffic; metrics and SLOs tell you whether users are actually happy.