Guide
Autoscaling explained
Your API handled yesterday's traffic on three containers. This morning a product launch triples requests and latency spikes before anyone opens the capacity dashboard. Autoscaling closes that gap: policies watch load signals and add or remove compute automatically so capacity tracks demand without a human in the loop. Done well, you pay for what you use and survive spikes. Done poorly, you scale too late (outages), too eagerly (runaway bills), or into flapping chaos where replicas appear and vanish every minute. This guide covers vertical vs horizontal scaling, which metrics actually predict overload, Kubernetes Horizontal Pod Autoscaler (HPA) and cluster autoscaler mechanics, cloud auto scaling groups, scale-to-zero trade-offs, hysteresis and cooldown design, a worked API example, a decision table, common pitfalls, and a production checklist — with links to load balancing, Kubernetes, and load testing for the surrounding infrastructure.
Vertical vs horizontal scaling
Vertical scaling (scale up) gives each instance more CPU, memory, or disk — resize a VM from 4 to 16 vCPUs, bump container limits. It is simple and avoids distributed-system problems (sticky sessions, shard rebalancing) but hits hardware ceilings and usually requires downtime or a brief restart. One giant node is still a single point of failure.
Horizontal scaling (scale out) runs more identical replicas behind a load balancer. That is what most web autoscaling targets: stateless API pods, worker processes, or serverless function concurrency. Each new replica adds throughput until shared bottlenecks — database connections, cache hot keys, license limits — dominate. Autoscaling in production almost always means horizontal replica count, sometimes combined with vertical tweaks via the Kubernetes Vertical Pod Autoscaler (VPA) for right-sizing requests and limits.
Reactive vs predictive autoscaling
Reactive (threshold-based) scaling is the default: when average CPU across pods exceeds 70% for five minutes, add two replicas; when it drops below 30%, remove one. Reactive policies are easy to reason about but inherently lag — by the time CPU rises, requests may already queue. Mitigate with faster metrics (request rate, queue depth), lower targets, and pre-warming before known events.
Predictive scaling uses historical patterns or ML forecasts to add capacity before the spike — AWS Predictive Scaling, scheduled rules ("scale to 20 at 08:55 UTC weekdays"), or calendar integrations for Black Friday. Predictive reduces cold-start pain but costs money on quiet days if forecasts are wrong. Most teams combine a scheduled floor for business hours with reactive headroom above it.
Signals that should drive scale decisions
CPU utilization is the most common HPA metric — and often the wrong sole signal. A Node.js service waiting on Postgres can show 15% CPU while every request times out. Pick metrics tied to user-visible saturation:
- Requests per second (RPS) or concurrent connections — direct capacity proxy for stateless HTTP.
- Queue depth / lag — messages waiting in SQS, Kafka consumer lag, or job backlog; scales workers before queues explode.
- Latency percentiles — p95 or p99 from your observability stack; scale when SLO breach is imminent (requires custom metrics adapters).
- Memory utilization — essential for JVM heaps, ML inference, or in-process caches; OOM kills do not recover with CPU headroom.
- Custom application metrics — active WebSocket sessions, GPU utilization, tokens-per-second for LLM serving.
Use multiple metrics with OR semantics where supported: scale up if
CPU > 70% or queue depth > 1,000. Scale-down should be conservative
— only one metric below threshold, longer stabilization windows — so you do not
shed capacity while another signal is still hot.
Kubernetes: HPA, VPA, and cluster autoscaler
In Kubernetes, three layers cooperate:
Horizontal Pod Autoscaler (HPA)
Adjusts Deployment replica count. The metrics server (or Prometheus
adapter) supplies current utilization; the HPA controller compares against targets
and patches spec.replicas. Example policy: target 60% average CPU,
min 2, max 50 replicas, scale-up stabilization 0s (aggressive), scale-down
stabilization 300s (slow).
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
name: api-hpa
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: api
minReplicas: 2
maxReplicas: 50
metrics:
- type: Resource
resource:
name: cpu
target:
type: Utilization
averageUtilization: 60
Cluster autoscaler
Adds or removes worker nodes when pods cannot schedule (pending due to insufficient CPU/memory) or nodes sit underutilized. HPA without cluster autoscaler hits a ceiling: replicas want to grow but no node has room. Conversely, cluster autoscaler without HPA leaves empty nodes billing while pods stay at minReplicas.
Vertical Pod Autoscaler (VPA)
Recommends or mutates container requests and limits based
on historical usage. Rarely combined with HPA on the same CPU metric (they fight).
VPA shines for batch jobs and services with slowly drifting memory needs; HPA handles
replica count.
Cloud auto scaling groups (VM and container)
Outside Kubernetes, the pattern is similar. AWS Auto Scaling Groups attach to an Application Load Balancer target group: launch template defines the AMI or container instance, desired capacity floats between min and max, health checks replace unhealthy instances. GCP Managed Instance Groups and Azure VM Scale Sets follow the same model. For ECS or Cloud Run, the platform scales task or instance count from CPU, RPS, or custom CloudWatch / Cloud Monitoring metrics.
Load balancers must register new instances only after they pass health checks — otherwise traffic hits half-started processes during scale-up. Connection draining on scale-down gives in-flight requests time to finish before the instance terminates.
Scale-to-zero and cold starts
Serverless platforms and Knative-style workloads scale to zero when idle, eliminating
idle cost. The trade-off is cold start latency: JVM boot, language
runtime init, or container image pull can add seconds before the first request succeeds.
Scale-from-zero works for async jobs and low-traffic internal APIs; it is risky for
user-facing latency-sensitive paths unless you keep minReplicas: 1 or
use provisioned concurrency. Measure cold start p99 in
load tests
before relying on zero minimums.
Hysteresis, cooldowns, and flapping
Without separation between scale-up and scale-down thresholds, a service hovering at 70% CPU adds a replica, average drops to 45%, removes the replica, spikes again — flapping. Fix with:
- Asymmetric thresholds — scale up at 70% CPU, scale down only below 40%.
- Stabilization windows — Kubernetes HPA v2
behaviorblocks scale-down for N seconds after the last scale event. - Cooldown periods — cloud ASGs ignore further scale-in until instances have run minimum lifetime (often 300s).
- Step policies — add at most +4 pods per minute instead of jumping straight to maxReplicas.
Aggressive scale-up and conservative scale-down is the usual production default: outages are worse than a few extra dollars per hour of over-provisioned capacity.
Minimum replicas, maximum caps, and cost
minReplicas is your availability floor across zones and during deploys.
Running min 2 in two AZs means one node failure still serves traffic. maxReplicas
is a budget circuit breaker — without it, a retry storm or DDoS can spin hundreds
of expensive GPUs. Review max against monthly spend alerts and
backpressure
upstream (rate limits, queue shedding) so autoscaling is not your only defense.
Autoscaling saves money on diurnal traffic (nights, weekends) but never optimizes architecture: a N+1 query or missing cache burns CPU on every replica. Fix efficiency before widening maxReplicas.
Worked example: stateless REST API
A JSON API on Kubernetes serves 800 RPS at peak with p95 latency target 200ms. Each pod handles ~100 RPS at 50% CPU when healthy. Baseline plan:
- Set
minReplicas: 3(N+1 across three zones),maxReplicas: 30. - HPA target: 55% CPU and custom metric 80 RPS per pod (whichever demands more replicas).
- Scale-up: +50% pods or +4 pods per 60s, whichever is greater; stabilization 0s.
- Scale-down: −10% pods per 120s; stabilization 300s; never below minReplicas.
- Cluster autoscaler on the node pool; nodes use
scale-down-delay-after-add: 10m. - ALB health check
/healthwith 15s grace on new pods before receiving traffic. - Scheduled rule: minReplicas 6 on weekdays 07:00–19:00 UTC for marketing email blasts.
Load test validates that scale-up completes within two minutes of a step traffic increase — faster than marketing can notice. If not, lower CPU target or pre-warm scheduled capacity.
Decision table: when autoscaling helps vs hurts
| Workload shape | Autoscaling fit | Notes |
|---|---|---|
| Stateless HTTP API | Excellent | CPU + RPS metrics; minReplicas for HA |
| Queue workers | Excellent | Scale on queue depth or consumer lag |
| WebSocket / game servers | Moderate | Session stickiness; scale on connection count |
| Single-leader databases | Poor (vertical only) | Use read replicas; primary does not scale out writes |
| Batch ETL with deadlines | Good | Scale workers for backlog; scale to zero after job |
| GPU inference | Good with caps | Expensive; maxReplicas tied to budget; queue first |
Common pitfalls
- CPU-only HPA on I/O-bound services — replicas stay flat while latency explodes; add queue or latency metrics.
- No cluster autoscaler — HPA creates pending pods that never schedule.
- Scaling before health checks pass — new replicas receive traffic while still starting; increase
initialDelaySeconds. - Shared database connection exhaustion — 50 pods each open 20 connections; Postgres max_connections blown; pool at app or PgBouncer layer.
- Symmetric up/down thresholds — flapping and alert noise; widen the gap.
- Unbounded maxReplicas — financial incident during attack or bug; set caps and rate limits.
- Ignoring scale-down disruption — SIGTERM without graceful shutdown drops in-flight work; see graceful shutdown guides.
- Never load testing autoscale path — policies untested until real Black Friday.
Production checklist
- Define SLO (latency, error rate) and pick metrics that predict SLO breach.
- Set minReplicas for zone redundancy and maxReplicas for budget safety.
- Configure asymmetric scale-up vs scale-down with stabilization windows.
- Ensure load balancer health checks gate traffic to new instances.
- Wire cluster autoscaler (or cloud ASG) so new pods can actually schedule.
- Validate connection pools and downstream limits scale with replica count.
- Run load tests that trigger scale-out and measure time-to-ready replicas.
- Dashboard replica count, pending pods, and $/hour per service.
- Document scheduled pre-warm for known traffic events.
- Revisit policies quarterly as code efficiency and traffic shape change.
Key takeaways
- Autoscaling matches replica count to demand — horizontal scaling is the default for web services.
- Metric choice matters more than algorithm — queue depth and RPS often beat CPU alone.
- HPA + cluster autoscaler + load balancer health checks form a complete Kubernetes stack.
- Aggressive scale-up, slow scale-down prevents flapping and protects users.
- maxReplicas and efficiency work are as important as enabling autoscaling in the first place.
Related reading
- Kubernetes fundamentals explained — pods, Deployments, Services, and built-in HPA hooks
- Load balancing explained — distributing traffic across autoscaled replicas
- Load testing explained — proving scale-out latency before production spikes
- Microservices architecture explained — per-service scaling and shared bottleneck risks