Guide

Autoscaling explained

Traffic is not flat. A product launch, a viral post, or a Black Friday sale can multiply request volume in minutes. Running enough servers for peak load 24/7 wastes money; running too few drops users during spikes. Autoscaling closes that gap by adding or removing compute automatically based on signals you define. This guide covers vertical versus horizontal scaling, reactive and predictive policies, which metrics to trust (CPU, requests per second, queue depth), how Kubernetes Horizontal Pod Autoscaler and cluster autoscaler interact, scale-to-zero cold-start traps, hysteresis and cooldown windows, cost guardrails, a Harbor Fleet order API worked example, a strategy decision table, common pitfalls, and a production checklist — alongside our load balancing guide and Prometheus monitoring explainer.

Vertical vs horizontal scaling

Vertical scaling (scale up) gives each instance more CPU, RAM, or disk. It is simple — change an instance type, restart, done — but hits hard ceilings (largest VM size, single-node database limits) and usually requires downtime or brief disruption. Horizontal scaling (scale out) adds more identical instances behind a load balancer. Stateless web APIs and workers are the classic horizontal targets; monolithic databases often scale vertically first, then shard or replicate.

Modern autoscaling almost always means horizontal replica count for stateless tiers, optionally combined with vertical node resizing at the infrastructure layer. The two dimensions are complementary: cluster autoscaler provisions bigger or more nodes; HPA spreads pods across them.

Reactive vs predictive autoscaling

Reactive (threshold-based) scaling watches live metrics and adds capacity when a signal crosses a target for a sustained window. It is easy to reason about but lags demand — by the time CPU hits 80%, users may already see latency spikes. Predictive scaling uses historical patterns (same hour yesterday, known campaign schedules) to pre-warm capacity before the spike. AWS Auto Scaling scheduled actions, Google Cloud predictive autoscaling, and custom cron-based replica bumps are common implementations. Production stacks often blend both: predictive baseline plus reactive headroom.

What to scale on: metrics that actually work

The wrong metric causes either thrashing (rapid scale-up/down loops) or under-provisioning (latency rises before the scaler reacts). Pick signals tied to user-visible pain or backlog growth.

CPU and memory utilization

CPU average across pods is the default Kubernetes HPA metric. It works when work is CPU-bound and requests/limits are set honestly. It fails when pods are I/O-bound (waiting on databases) — CPU stays low while queues grow. Memory-based HPA is supported but risky: scaling on memory can trigger OOM kills during scale-up lag. Always set resource requests so the scheduler and autoscaler share the same picture of capacity.

Requests per second and latency

For HTTP services, scaling on RPS or custom metrics like p95 latency tracks demand more directly than CPU. You need a metrics pipeline (Prometheus plus adapters, or cloud-native metrics) feeding the autoscaler. Target “requests per pod” rather than global RPS so the math stays stable as replica count changes.

Queue depth and consumer lag

Background workers should scale on queue depth, oldest-message age, or Kafka consumer lag — not CPU. A backlog of 10,000 jobs with idle CPU means you need more consumers, not bigger machines. Pair queue-based scaling with max concurrency per pod so new replicas actually drain work.

External and custom metrics

Cloud load balancers expose active connection counts; serverless platforms expose concurrent executions. Custom metrics (GPU utilization, embedding batch size, cache hit ratio) belong in HPA v2 via the metrics API. Document the SLO each metric protects so on-call engineers know why replica count moved.

Kubernetes autoscaling in practice

On Kubernetes, three layers often run together:

Horizontal Pod Autoscaler (HPA) — adjusts Deployment replica count from metrics. Default target: 70% average CPU utilization. Supports min/max bounds, scale-down stabilization windows, and multiple metrics (take the highest recommended replica count).
Vertical Pod Autoscaler (VPA) — recommends or mutates CPU/memory requests. Rarely combined with HPA on the same workload without careful tuning; many teams use VPA in recommendation-only mode.
Cluster Autoscaler — adds or removes worker nodes when pods cannot schedule (pending due to insufficient CPU/memory) or when nodes sit underutilized. It does not replace HPA; it supplies the floor HPA consumes.

A typical scale-out path: traffic rises, per-pod CPU crosses the HPA target, HPA requests more replicas, new pods pend if the cluster lacks capacity, cluster autoscaler provisions a node, pods schedule, the service endpoints update. Scale-in reverses with delays — HPA waits before removing pods; cluster autoscaler waits longer before draining nodes to avoid flapping.

Scale to zero and cold starts

Knative, KEDA, and some serverless platforms allow scale to zero replicas when demand is absent. That saves money on idle services but introduces cold starts: container pull, JVM warmup, connection pool init, and JIT compilation can add seconds of latency on the first request. Use scale-to-zero for internal tools and batch workers; keep minimum replicas ≥ 1 (often 2 for HA) on user-facing APIs with tight latency SLOs. Pre-warm images on nodes and use startup probes so Kubernetes does not route traffic before the app is ready.

Hysteresis, cooldowns, and cost caps

Without damping, autoscalers oscillate: scale up at 70% CPU, load spreads thin, CPU drops to 30%, scale down, load concentrates, spike again. Fix this with:

Asymmetric thresholds — scale up at 70% CPU, scale down only below 40%.
Stabilization windows — HPA behavior.scaleDown.stabilizationWindowSeconds ignores brief dips.
Cooldown periods — cloud ASGs enforce minimum time between activities; mirror this in custom scalers.
Max replica caps — hard ceiling prevents runaway bills during DDoS or retry storms.
Scale-down limits — remove at most N pods per minute so draining connections finish gracefully.

Tag autoscaling groups and namespaces with cost-center labels. Alert when replica count or node count exceeds budget thresholds for more than an hour — autoscaling should not be a silent budget leak.

Worked example: Harbor Fleet order API

Harbor Fleet runs a stateless order API on Kubernetes behind an AWS Application Load Balancer. Baseline: 3 replicas, 500m CPU request, 1 CPU limit, HPA min 3 / max 30, target 65% CPU, scale-down stabilization 300 seconds. Prometheus exports http_requests_per_second per pod via a custom metrics adapter; HPA also watches p95 latency > 250ms as a secondary metric (max of the two replica calculations wins).

During a flash sale, RPS jumps 8x in two minutes. CPU per pod hits 85%; HPA adds 6 replicas within 90 seconds. Four new pods pend — cluster autoscaler adds two m6i.large nodes in the node pool. p95 latency peaks at 310ms then falls to 120ms as endpoints register. After the sale, RPS drops but stabilization prevents scale-down for five minutes, avoiding thrash from checkout retries. Ops reviews the event in Grafana: replica count, ALB target health, and HPA decisions on one dashboard. Takeaway: dual metrics (CPU + latency) caught an I/O-heavy spike that CPU alone would have under-provisioned; min replicas = 3 preserved HA during node bootstrap.

Strategy decision table

Workload type	Scale dimension	Primary metric	Notes
Stateless REST/GraphQL API	Horizontal (HPA)	RPS per pod or p95 latency	Min replicas ≥ 2; avoid scale-to-zero on hot path
CPU-bound batch workers	Horizontal	Queue depth or job age	Scale on backlog, not CPU
WebSocket / long-poll	Horizontal + sticky sessions	Active connections per pod	Draining matters on scale-down
Monolithic JVM service	Vertical first, then horizontal	Heap pressure + GC pause time	Cold JVM warmup hurts fast scale-out
GPU inference	Horizontal on GPU pool	Queue wait time or GPU util	Expensive nodes — tight max cap
Nightly ETL	Scheduled + reactive	Time window + CPU	Predictive cron pre-warms; reactive handles overruns
Multi-tenant SaaS	Horizontal per service	Per-tenant rate limits + global CPU	Noisy neighbor isolation via separate HPAs

Common pitfalls

Missing or wrong resource requests — HPA divides usage by requests; unset requests make scaling math meaningless.
Scaling on CPU for I/O-bound apps — pods look idle while latency explodes; add queue or latency metrics.
No readiness probes — traffic hits starting pods; users see errors during scale-out.
Too-aggressive scale-down — connection drops and cache loss; use PDBs and stabilization windows.
Cluster autoscaler without headroom — HPA requests pods that cannot schedule; pending pods are not serving traffic.
Ignoring scale-up lead time — new nodes take minutes; predictive scaling or over-provisioned min replicas cover known events.
Retry storms amplifying load — clients retry on 503s, doubling RPS; fix clients and use circuit breakers.
Unbounded max replicas — autoscaling during attacks or bugs can 10x cloud spend in an hour.

Production checklist

Define SLOs (latency, error rate, queue age) before choosing scale metrics.
Set CPU/memory requests and limits on every pod HPA manages.
Configure min replicas for HA; max replicas for cost control.
Add readiness and liveness probes; verify new pods pass readiness before receiving traffic.
Enable scale-down stabilization and asymmetric scale-up vs scale-down policies.
Ensure cluster autoscaler node pools have room for max HPA replica count.
Export HPA decisions and replica count to your metrics stack; dashboard beside latency.
Load-test scale-out path (including node provisioning time) quarterly.
Document runbooks for “stuck at max replicas” and “pending pods” alerts.
Review autoscaling bills monthly; adjust targets if average utilization is consistently low.

Key takeaways

Horizontal autoscaling adds replicas; vertical scaling grows each instance — most web tiers scale out.
Pick metrics tied to user pain or backlog, not CPU alone.
Kubernetes HPA and cluster autoscaler solve different problems and must be tuned together.
Hysteresis and cooldowns prevent thrashing; max caps prevent bill shocks.
Scale to zero saves money on idle workloads but trades latency on cold start.