Guide
Health checks, liveness and readiness probes explained
A health check is a lightweight request that answers one question: is this instance safe to keep running, and should it receive traffic? Orchestrators like Kubernetes, cloud load balancers, and service meshes all poll these endpoints on a schedule. Get the semantics wrong and you get restart loops that amplify outages, pods that serve 500s while marked healthy, or deploys that never finish because readiness never flips green. This guide separates liveness, readiness, and startup probes, explains shallow vs deep checks, covers timing and failure thresholds, ties health checks to graceful shutdown and load balancer routing, and ends with a production checklist grounded in observability practice.
Three probe types, three different jobs
Kubernetes popularized the vocabulary, but the concepts apply anywhere an orchestrator manages replicas behind a load balancer. Each probe type has a distinct action when it fails — confusing them is the most common production mistake.
Liveness probe: is the process stuck?
A liveness probe asks whether the main process is alive and making forward progress. Failure triggers a restart of the container. Use liveness only for deadlocks, infinite loops, or memory leaks that leave the HTTP server responding but the app unable to recover. Do not put dependency checks (database, cache, third-party API) on liveness — a transient Postgres blip would restart every pod simultaneously, making the outage worse.
Readiness probe: should this instance get traffic?
A readiness probe gates traffic. When it fails, the pod stays running but is removed from Service endpoints and load balancer target groups. This is where you verify dependencies: can the app reach the database, has the cache warmed, is the feature flag client initialized? Readiness failure during a rolling deploy is normal — the old pod drains while the new one warms up.
Startup probe: slow boot without premature kills
Cold starts — JVM heap allocation, loading ML models, running database migrations — can exceed default liveness timeouts. A startup probe disables liveness checks until it succeeds once, giving the process a longer budget to boot. After startup passes, liveness and readiness take over. Without a startup probe, Kubernetes may restart a pod that would have become healthy ten seconds later.
What to check: shallow, deep, and dependency tiers
Most teams expose one or more HTTP routes. Naming varies; semantics matter more than the path string.
Shallow (process-level) checks
Return 200 OK if the event loop is responsive and the process
has not deadlocked. Typical liveness implementation: a handler that returns
{"status":"ok"} without touching external systems. Cost should
be microseconds — these run every few seconds on every replica.
Deep (dependency-level) checks
Readiness endpoints often run a deep health check: ping the
primary database with SELECT 1, verify Redis responds to
PING, confirm an object store bucket is reachable. Keep each
dependency check fast (sub-100 ms with tight timeouts) and parallelize them.
Return structured JSON listing each dependency and its status so on-call
engineers see which link broke:
{"status":"degraded","checks":{"postgres":"ok","redis":"timeout"}}
Decide policy explicitly: does one degraded dependency fail readiness entirely, or do you serve partial functionality? Document the rule — "Redis down → fail readiness" vs "Redis down → serve stale cache" changes user impact.
Separate metrics from health
Do not overload health endpoints with business metrics (queue depth, error rate). Those belong in Prometheus gauges and alerting rules tied to SLO error budgets. Health checks answer binary routing questions; metrics answer trend questions.
Kubernetes probe configuration
Probes support HTTP GET, TCP socket, gRPC, or exec commands. HTTP is the default for web services. Key fields in the pod spec:
initialDelaySeconds— wait before the first probe (avoid false failures during boot).periodSeconds— interval between probes (typically 5–10 s for liveness, 5 s for readiness).timeoutSeconds— per-probe deadline (1–3 s for shallow; up to 5 s for deep readiness if parallelized).failureThreshold— consecutive failures before action (3 failures × 10 s period = 30 s before restart).successThreshold— consecutive successes to flip ready (1 for liveness; sometimes 2 for readiness to avoid flapping).
Example timing for a typical API
Liveness: HTTP /healthz, shallow, periodSeconds: 10,
failureThreshold: 3 — three missed heartbeats before restart.
Readiness: HTTP /readyz, deep, periodSeconds: 5,
failureThreshold: 2 — removed from Service after ~10 s of
dependency failure. Startup: same path as liveness,
failureThreshold: 30 with periodSeconds: 10 — up
to five minutes for slow boot, then hand off.
HTTP status codes and bodies
Return 200 for healthy/ready, 503 Service Unavailable
for not ready. Some teams use 500 on liveness failure — either
works if consistent. Avoid redirects (301/302) on probe paths; kubelet follows
redirects and may hit the wrong handler.
Load balancer and service mesh integration
Cloud load balancers (ALB, NLB, GCP HTTP LB) run their own health checks independent of Kubernetes. A pod can pass kubelet readiness while the ALB still marks the target unhealthy if paths, ports, or security groups differ.
- Align paths and ports — ALB health check path should
match readiness (often
/readyzor/health). - Healthy threshold vs interval — ALB defaults (2 successes / 30 s interval) can delay traffic longer than Kubernetes readiness. Tune for your deploy speed.
- Connection draining — when readiness fails, in-flight requests should complete. Pair probe failure with preStop hooks so the LB stops sending new connections before SIGTERM.
- Service mesh sidecars — Envoy may expose separate admin health; ensure app readiness reflects app dependencies, not just sidecar bootstrap.
Failure modes and anti-patterns
Restart storms from dependency checks on liveness
Putting Postgres on liveness killed an entire fleet when the database had a 60-second failover. Every pod restarted, re-opened connection pools, and amplified load on the recovering primary. Move dependency checks to readiness only.
Probe flapping during GC pauses
A stop-the-world GC pause longer than timeoutSeconds fails
liveness and restarts the JVM — right when it was about to finish collecting.
Increase timeout slightly, add startup probe budget, or switch to TCP socket
liveness that only verifies the port accepts connections.
Health endpoint on the public internet
Exposing deep readiness JSON publicly leaks infrastructure details (hostnames, internal service names). Bind admin/health routes to a separate port or restrict by network policy; load balancers reach them on internal interfaces.
Shared health state across replicas
Writing probe results to a shared cache creates false positives — one replica marks all as healthy. Health must be evaluated per instance.
Ignoring startup ordering in init containers
If migrations run in an init container, readiness should not pass until the app version matches the schema version. Otherwise traffic hits code expecting columns that do not exist yet.
Probe type decision table
| Scenario | Probe type | What to check | Failure action |
|---|---|---|---|
| Event loop deadlock | Liveness | Shallow HTTP / TCP port open | Restart container |
| Database unreachable | Readiness | SELECT 1 with 1 s timeout |
Remove from load balancer |
| JVM cold start (2 min) | Startup | Same as liveness, high failureThreshold | Suppress liveness until pass |
| Rolling deploy warmup | Readiness | Cache warm + dependency OK | Hold traffic until green |
| Memory leak (slow death) | Liveness + metrics alert | Shallow + OOM watch via Prometheus | Restart or scale |
| Dependency degraded but optional | Readiness (policy) | Partial check JSON | Team-defined: serve or drain |
Implementation checklist
- Expose separate
/healthz(liveness, shallow) and/readyz(readiness, deep) endpoints. - Never put external dependency checks on liveness.
- Add a startup probe if boot takes more than 30 seconds.
- Parallelize dependency checks; enforce per-check timeouts under 2 s.
- Return structured JSON on readiness failures for faster incident triage.
- Align Kubernetes readiness with cloud load balancer health check paths.
- Pair readiness failure with connection draining and preStop hooks.
- Load-test probe overhead — thousands of replicas × 5 s interval adds up.
- Alert on readiness failure rate, not individual blips during deploys.
- Document degraded-mode policy for each dependency.
Key takeaways
- Liveness restarts stuck processes; keep it shallow and dependency-free.
- Readiness gates traffic — use it for database, cache, and warmup checks.
- Startup probes protect slow-boot apps from premature liveness kills.
- Align orchestrator and load balancer health paths so deploys do not flap between systems.
- Health checks route traffic; metrics and SLOs tell you whether users are actually happy.
Related reading
- Kubernetes fundamentals explained — pods, Services, Deployments, and the control plane that runs probes
- Graceful shutdown explained — SIGTERM, preStop hooks, and draining in-flight requests
- Load balancing explained — target groups, health check intervals, and sticky sessions
- Observability explained — metrics, logs, and traces beyond binary health