Scaling to a Distributed System
Availability is a measure of the percentage of time a system is operational and able to serve requests. It's often expressed in "nines":
In distributed systems, achieving high availability is not about preventing failures—failures are inevitable. It's about building a system that can tolerate failures and continue to function. This is achieved through redundancy.
Redundancy means having duplicate, standby components that can take over if a primary component fails. There are two main types of redundancy:
What it is: In this model, you have one active component and one or more passive (standby) components. The active component handles all the traffic, while the passive components are idle, waiting to take over if the active one fails.
Failover: The process of the passive component taking over is called failover. This is orchestrated by a monitoring service that performs regular health checks on the active component. If the health check fails, the monitor triggers the failover process, which typically involves redirecting traffic to the passive component.
Example: A high-availability load balancer setup. One load balancer is active, and a second, identical one is passive. If the active one fails, the passive one takes its IP address and becomes the new active load balancer.
Pros:
Cons:
What it is: In this model, all components are active and are handling a share of the workload simultaneously.
Example: A pool of application servers behind a load balancer. All servers are actively handling requests. If one server fails, the load balancer simply stops sending traffic to it, and the remaining servers absorb the load.
Pros:
Cons:
High availability is not just about one component; it's about building redundancy into every layer of your system.
A single datacenter is a single point of failure. A power outage, network failure, or natural disaster could take your entire application offline.
To achieve very high availability, you must run your application in multiple datacenters or availability zones (AZs). An AZ is one or more discrete data centers with redundant power, networking, and connectivity in a region.
As discussed in the Database Replication chapter, you should never run a single database instance.
Your application servers should be stateless and run in an active-active configuration behind a load balancer. This makes them resilient to individual server failures.
In a system design interview, you should always be thinking about redundancy. When you introduce a component into your design (a load balancer, a database, a cache), you should immediately ask yourself: "What happens if this component fails?" The answer should always be to have a redundant, standby component ready to take over. This demonstrates that you are designing for resilience and high availability.