Database Replication

Data in a Distributed World

A single database server is a single point of failure. If it crashes, your entire application goes down, and you could lose data. Database replication is the process of copying data from a primary database server to one or more secondary servers, known as replicas.

This is a fundamental technique for achieving high availability, fault tolerance, and read scalability.

Why Replicate Your Database?

  1. High Availability & Fault Tolerance: If the primary server fails, a replica can be promoted to become the new primary, allowing the application to continue functioning with minimal downtime. This process is called failover.
  2. Read Scalability: For read-heavy workloads, you can direct all write operations to the primary server and distribute read operations across the replicas. This allows you to scale out your read capacity by simply adding more replicas.
  3. Disaster Recovery: You can maintain a replica in a different geographic location. If a natural disaster takes out your primary datacenter, you can failover to the replica in another region.

Common Replication Architectures

There are several ways to configure replication, each with its own trade-offs.

1. Leader-Follower (or Master-Slave) Replication

This is the most common replication model.

How it works:

  • One server is designated as the Leader (or Master). It is responsible for handling all write operations.
  • One or more other servers are designated as Followers (or Slaves). They receive a copy of all data changes from the leader and apply them to their own local copy of the data.
  • Clients can read from either the leader or the followers, but can only write to the leader.

Replication Lag: A critical concept in this model is replication lag. This is the delay between when a write is committed on the leader and when it becomes visible on a follower. This lag can be a source of inconsistency. For example, a user might write a post, then immediately reload their profile page (which reads from a follower), and not see their new post because the change hasn't been replicated yet. This is a form of eventual consistency.

Pros:

  • Simple to understand and implement.
  • Excellent for scaling reads.
  • Clear separation of concerns (writes go to one place, reads can go to many).

Cons:

  • The leader is a single point of failure for writes. If it goes down, you cannot accept any new writes until a failover is complete.
  • Replication lag can cause stale reads.
  • Failover is not always instantaneous and can be complex to automate correctly.

2. Multi-Leader (or Master-Master) Replication

How it works:

  • Two or more servers are designated as leaders. Each leader can accept write operations.
  • Each leader replicates its writes to the other leaders.

When to use it: This model is often used in multi-datacenter deployments. You can have a leader in each datacenter, allowing for low-latency writes for users in that region.

Pros:

  • Low Write Latency: Users can write to their nearest datacenter.
  • High Availability for Writes: If one leader goes down, the other leaders can still accept writes.

Cons:

  • Write Conflicts: This is the biggest challenge. What happens if two users try to modify the same piece of data on two different leaders at the same time? This requires a conflict resolution strategy, which can be very complex to get right.
  • The replication topology is much more complex than leader-follower.

3. Leaderless (or Peer-to-Peer) Replication

How it works:

  • There are no designated leaders. All replicas are peers and can accept both reads and writes for any piece of data.
  • This model is common in NoSQL databases like Cassandra and DynamoDB.

Quorum for Reads and Writes: To maintain consistency, these systems use a concept called a quorum.

  • Let N be the number of replicas for a piece of data.
  • Let W be the write quorum. A write must be acknowledged by W replicas to be considered successful.
  • Let R be the read quorum. A read must query R replicas and take the most recent version of the data.

The system is configured such that W + R > N. This is known as a quorum overlap. This mathematical guarantee ensures that a read will always see at least one replica that has the latest write, preventing stale reads.

Example: N = 3. You could set W = 2 and R = 2.

  • A write must succeed on 2 out of 3 replicas.
  • A read must query 2 out of 3 replicas.
  • Since 2 + 2 > 3, the read quorum is guaranteed to overlap with the write quorum by at least one node, ensuring it gets the latest data.

Pros:

  • Extreme Fault Tolerance and Availability: The system can tolerate the failure of multiple nodes and can still accept reads and writes.
  • Low Latency: Clients can read and write from any replica, ideally the closest one.

Cons:

  • Consistency Trade-offs: Achieving strong consistency can be more complex and may require sacrificing some availability.
  • Conflict Resolution: Similar to multi-leader, write conflicts can occur and require strategies like "last write wins" to resolve them.

In an interview, discussing the trade-offs between these replication strategies—especially the consistency guarantees vs. performance and availability—is a key way to demonstrate your depth of knowledge.