High Availability & Disaster Recovery
Design systems for high availability and implement disaster recovery strategies — from backup/restore to active-active multi-region deployments.
50 min•By Priygop Team•Last updated: Feb 2026
Availability Tiers & Design
- 99.9% (8.7 hours/year downtime): Single-region, multi-AZ deployment with auto-scaling and load balancing — sufficient for most applications
- 99.95% (4.3 hours/year): Multi-AZ with automated failover, health checks, and self-healing infrastructure — requires robust monitoring
- 99.99% (52 minutes/year): Multi-region active-standby with automated DNS failover — requires database replication and cross-region deployment
- 99.999% (5 minutes/year): Multi-region active-active with global load balancing — the most complex and expensive, reserved for critical financial and healthcare systems
- Key Principle: Each extra 9 roughly 10x the cost and complexity — choose the right tier for your business needs, not the highest one
Disaster Recovery Strategies
- Backup & Restore (RPO: hours, RTO: hours-days): Regular backups stored in another region. Cheapest but slowest recovery. Good for non-critical systems
- Pilot Light (RPO: minutes, RTO: hours): Core infrastructure always running in DR region (database replicated), but compute resources are off. Scale up on failover
- Warm Standby (RPO: seconds, RTO: minutes): Scaled-down copy of production running in DR region — scale up to full capacity on failover
- Active-Active (RPO: near-zero, RTO: near-zero): Full production in multiple regions simultaneously — traffic served from all regions, automatic failover. Most expensive
- RPO (Recovery Point Objective): Maximum acceptable data loss — how much data can you afford to lose? Determines backup/replication frequency
- RTO (Recovery Time Objective): Maximum acceptable downtime — how fast must you recover? Determines DR strategy complexity