Availability Patterns (Failover, Redundancy)
Scaling to a Distributed System
Availability is a measure of the percentage of time a system is operational and able to serve requests. It's often expressed in "nines":
- 99% ("two nines"): ~3.65 days of downtime per year
- 99.9% ("three nines"): ~8.77 hours of downtime per year
- 99.99% ("four nines"): ~52.6 minutes of downtime per year
- 99.999% ("five nines"): ~5.26 minutes of downtime per year
In distributed systems, achieving high availability is not about preventing failures—failures are inevitable. It's about building a system that can tolerate failures and continue to function. This is achieved through redundancy.
Redundancy: The Core of High Availability
Redundancy means having duplicate, standby components that can take over if a primary component fails. There are two main types of redundancy:
1. Active-Passive Redundancy
What it is: In this model, you have one active component and one or more passive (standby) components. The active component handles all the traffic, while the passive components are idle, waiting to take over if the active one fails.
- Analogy: A car's spare tire. It's there and ready to go, but it's not used during normal operation.
Failover: The process of the passive component taking over is called failover. This is orchestrated by a monitoring service that performs regular health checks on the active component. If the health check fails, the monitor triggers the failover process, which typically involves redirecting traffic to the passive component.
Example: A high-availability load balancer setup. One load balancer is active, and a second, identical one is passive. If the active one fails, the passive one takes its IP address and becomes the new active load balancer.
Pros:
- Relatively simple to implement and understand.
Cons:
- Resource Inefficiency: The passive components are idle resources that you are paying for but not using for most of the time.
- Failover Time: Failover is not always instantaneous. It can take time to detect the failure and switch over, during which the service may be unavailable.
2. Active-Active Redundancy
What it is: In this model, all components are active and are handling a share of the workload simultaneously.
- Analogy: A team of rowers in a boat. If one rower stops, the others can continue rowing to keep the boat moving, albeit at a slightly slower pace.
Example: A pool of application servers behind a load balancer. All servers are actively handling requests. If one server fails, the load balancer simply stops sending traffic to it, and the remaining servers absorb the load.
Pros:
- Resource Efficiency: All components are actively contributing, so you get the full value of your resources.
- Instantaneous Failover: There is no "failover" event in the traditional sense. The system gracefully degrades as failures occur, without any single failure causing a complete outage.
- Scalability: This model is inherently scalable. You can add more active components to handle more load.
Cons:
- Increased Complexity: This is a distributed system, which comes with all the associated challenges of load balancing, data consistency, and service discovery.
Designing for Availability Across the Stack
High availability is not just about one component; it's about building redundancy into every layer of your system.
Datacenter Redundancy
A single datacenter is a single point of failure. A power outage, network failure, or natural disaster could take your entire application offline.
To achieve very high availability, you must run your application in multiple datacenters or availability zones (AZs). An AZ is one or more discrete data centers with redundant power, networking, and connectivity in a region.
- Multi-AZ Deployment: You run active-active or active-passive replicas of your services across multiple AZs within the same geographic region. This protects you from the failure of a single datacenter. If one AZ goes down, you can continue to serve traffic from the other(s). This is a standard practice for most cloud-based applications.
- Multi-Region Deployment: For even higher availability and disaster recovery, you can replicate your system across multiple geographic regions (e.g., US East and US West). This protects you from large-scale regional disasters. It also has the added benefit of reducing latency for users by serving them from their nearest region.
Database Redundancy
As discussed in the Database Replication chapter, you should never run a single database instance.
- Use Leader-Follower replication with automatic failover to a follower if the leader fails.
- For even higher availability, consider a Multi-Leader or Leaderless architecture, especially in a multi-region setup.
Application Server Redundancy
Your application servers should be stateless and run in an active-active configuration behind a load balancer. This makes them resilient to individual server failures.
In a system design interview, you should always be thinking about redundancy. When you introduce a component into your design (a load balancer, a database, a cache), you should immediately ask yourself: "What happens if this component fails?" The answer should always be to have a redundant, standby component ready to take over. This demonstrates that you are designing for resilience and high availability.