Guide
Disaster recovery and backup strategies explained
Disaster recovery (DR) is the set of policies, infrastructure, and runbooks that restore your systems after catastrophic failure — a data center outage, ransomware, operator error, or cloud region loss. Backups are the foundation; failover architecture and tested procedures turn backups into actual recovery. Two numbers define every DR plan: RPO (Recovery Point Objective) caps how much data you can afford to lose, and RTO (Recovery Time Objective) caps how long the business can stay offline. This guide covers backup types, the 3-2-1 rule, standby tiers, multi-region failover patterns, point-in-time recovery, DR testing, and how DR connects to database replication, deployment strategies, and infrastructure as code.
RPO and RTO: the two numbers that drive every decision
RPO answers: “If we fail right now, how far back in time can we roll?” An RPO of 15 minutes means you accept losing at most 15 minutes of writes. An RPO of zero means synchronous replication — no committed transaction is lost, but latency and cost rise.
RTO answers: “How long until users can use the product again?” An RTO of four hours might mean restoring a database from last night’s backup and redeploying app servers. An RTO of five minutes means automated DNS failover to a hot standby that is already running.
Tighter RPO and RTO cost more: more frequent backups, cross-region replication, always-on standby capacity, and regular DR drills. Product and finance teams should agree on tiered objectives per service — payment processing might need RPO < 1 min and RTO < 5 min; an internal analytics dashboard might tolerate RPO of 24 hours and RTO of 8 hours.
Backup types and the 3-2-1 rule
A full backup copies the entire dataset. Restore is simple (one file set) but slow to create and transfer. Incremental backups capture only changes since the last backup of any type — fast and storage-efficient, but restore chains through every incremental since the last full. Differential backups capture changes since the last full — faster restore than a long incremental chain, larger than a single incremental.
Continuous backup / point-in-time recovery (PITR) streams transaction logs (WAL in PostgreSQL, binlog in MySQL) to durable storage. You can restore to any second within the retention window — the gold standard for database RPO measured in seconds. Managed cloud databases (RDS, Cloud SQL, Aurora) offer PITR with one-click restore.
The 3-2-1 rule is the minimum sane backup posture: three copies of data, on two different media types, with one copy offsite (different region or provider). Local snapshots plus cross-region object storage satisfies this. Backups in the same account and region as production do not protect against account compromise or regional outages.
Standby tiers: cold, warm, and hot
Cold standby keeps backups and possibly infrastructure templates but no running servers. RTO is hours to days — restore data, provision VMs, deploy. Cheapest; suitable for non-critical systems or as a last-resort tier.
Warm standby runs scaled-down replicas — a smaller database replica catching up via async replication, app servers at minimum instance count. Failover promotes the replica and scales up. RTO of minutes to an hour; ongoing cost is moderate.
Hot standby / active-passive maintains a full-capacity passive cluster ready to take traffic on DNS or load-balancer flip. RTO of seconds to minutes. Highest standby cost but required for tier-1 services.
Active-active multi-region runs production traffic in two or more regions simultaneously. RTO approaches zero for regional failure (traffic routes to surviving region), but data consistency across regions is hard — see CAP theorem and consistency models for the trade-offs. Conflict resolution, sticky sessions, and write routing add significant engineering complexity.
Failover mechanics: DNS, load balancers, and database promotion
Application failover typically routes traffic via DNS (update A/AAAA or CNAME to the standby region) or a global load balancer (Cloudflare, AWS Global Accelerator, GCP Cloud Load Balancing) that health-checks backends and removes failed regions. DNS TTL affects RTO — a 300-second TTL means up to five minutes before all clients see the change. Low TTLs (60s or less) during incidents speed failover but increase DNS query load.
Database failover is the hard part. Automatic promotion of a read replica to primary requires orchestration (Patroni, RDS Multi-AZ, Aurora Global Database) and careful handling of replication lag — promoting a lagging replica loses data and can cause split-brain if the old primary recovers. Manual promotion with a written runbook is slower but safer for teams without battle-tested automation.
Pair DR with graceful shutdown and connection draining so in-flight requests complete before cutover. Idempotent handlers and outbox patterns prevent duplicate side effects when traffic shifts mid-transaction.
RPO/RTO tier decision table
| Service tier | Typical RPO | Typical RTO | Architecture pattern |
|---|---|---|---|
| Tier 1 (payments, auth) | < 1 minute | < 5 minutes | Sync or near-sync replication, hot standby, automated failover |
| Tier 2 (core product API) | 5–15 minutes | 15–60 minutes | Async replication, warm standby, scripted promotion |
| Tier 3 (internal tools) | 1–24 hours | 2–8 hours | Daily backups + PITR, cold standby, manual restore |
| Tier 4 (analytics, archives) | 24+ hours | 24+ hours | Weekly full backups, rebuild from warehouse |
Document tier assignments in your service catalog. When an incident strikes, everyone should know which runbook applies without debating whether the analytics pipeline deserves the same urgency as checkout.
DR testing: game days and restore drills
An untested backup is a wish, not a plan. Schedule quarterly restore drills: pick a random backup, restore to an isolated environment, verify data integrity and application boot. Measure actual RTO — it is almost always longer than the spreadsheet estimate.
Game days simulate full disasters: kill a region, corrupt a database, or revoke credentials. Teams execute the runbook under time pressure. Pair with chaos engineering for continuous smaller fault injections, but game days test the human and procedural layer — paging, communication, decision authority, customer messaging.
Track drill results in error budgets or internal reliability reviews. Failed drills are successes if they expose gaps before a real outage does.
Cloud-native DR patterns
Multi-AZ (single region, multiple availability zones) protects against single-host and single-AZ failures but not regional outages. RPO/RTO for AZ failure is typically minutes with managed services.
Cross-region replication copies data to a secondary region. Object storage (S3 cross-region replication, GCS dual-region buckets) handles static assets. Databases use async replicas or global database products. Application state in Redis or Memcached does not replicate automatically — plan for cache cold-start on failover.
Infrastructure as code lets you rebuild an entire region from version-controlled Terraform or Pulumi modules. Store IaC in git, not only in the cloud console. Secrets come from a vault at deploy time, not baked into AMIs.
Immutable backups with object-lock or WORM storage protect against ransomware that encrypts live systems and online backups. Air-gapped or write-once copies are increasingly required for compliance (SOC 2, HIPAA).
Worked example: regional outage during peak traffic
An e-commerce API runs in us-east-1 with Aurora PostgreSQL
(async replica in us-west-2), Redis for sessions, and S3 for
product images (replicated cross-region). Peak traffic: 10k orders/hour.
09:14 — AWS status page reports us-east-1
networking degradation. Global load balancer health checks fail on east
endpoints within 30 seconds. Automated routing shifts 100% traffic to
us-west-2 app tier (active-active frontends). RTO for HTTP:
~45 seconds.
09:18 — West app servers hit the read replica for writes (read-only). On-call promotes replica to primary via Aurora Global Database failover (~60 seconds). Replication lag at failure was 8 seconds — 8 seconds of writes at peak ≈ 22 orders need reconciliation from application logs. RPO achieved: ~8 seconds.
09:25 — Redis in west was empty; users re-authenticate. Session RPO was effectively zero (sessions lost) — acceptable per tier-2 definition. Customer comms post status page update. Post-incident: add Redis cross-region replication for tier-1 auth if policy changes.
Common pitfalls
- Backups that never get restored — corrupt backups discovered only during a real disaster. Automate monthly restore tests.
- Replication mistaken for backup — a deleted table replicates to the replica instantly. You need point-in-time or snapshot backups independent of live replication.
- Single-region everything — multi-AZ is not multi-region. Regional outages are rare but catastrophic when they hit.
- DNS TTL too high during steady state — 24-hour TTL turns a five-minute failover into a day-long partial outage.
- Untested runbooks — steps reference decommissioned tools, wrong hostnames, or people who left the company.
- DR environment drift — standby runs older code or schema because deploys skip the passive region. Tie DR infra to the same CI/CD pipeline as production.
- Ignoring dependencies — your API fails over but the payment provider webhook URL still points at the dead region.
Practitioner checklist
- Document RPO and RTO per service tier with business sign-off.
- Implement 3-2-1 backups with at least one offsite, immutable copy.
- Enable PITR on production databases; verify retention meets max RPO.
- Automate backup monitoring — alert on missed jobs, not just failed jobs.
- Run quarterly restore drills; record actual RTO vs target.
- Keep failover runbooks in a system that survives the outage (not only in the prod wiki).
- Sync DR infrastructure with production via IaC and the same deploy pipeline.
- Test DNS failover TTL; pre-stage low TTL before known maintenance.
- Measure replication lag continuously; alert before lag exceeds RPO.
- Include third-party webhooks, API keys, and DNS in the dependency map.
Key takeaways
- RPO defines acceptable data loss; RTO defines acceptable downtime — both must be business-driven, not engineer-guessed.
- Backups (especially PITR) and replication solve different problems; you need both.
- Standby tier (cold/warm/hot/active-active) is the main cost lever for achieving RTO targets.
- Untested DR plans fail in real incidents — restore drills and game days are non-negotiable.
- Cross-region DR requires consistency trade-offs, cache invalidation, and dependency updates — not just copying the database.
Related reading
- Database replication explained — read replicas, lag, and failover mechanics
- Blue-green and canary deployments explained — zero-downtime release patterns that pair with DR cutovers
- Infrastructure as code (Terraform) explained — reproducible environments for rapid region rebuild
- Observability explained — detect failures fast enough to meet RTO targets