Consistent Hashing

Data in a Distributed World

In the previous chapter on sharding, we saw that the simple hash(key) % N approach to distributing data across N servers has a major flaw: when you add or remove a server, N changes, and nearly all of your keys map to a different server. This requires a massive reshuffling of data, which is impractical for large-scale systems.

Consistent Hashing is a specialized hashing technique that solves this problem and is a cornerstone of many modern distributed systems, including CDNs, load balancers, and NoSQL databases like Cassandra and DynamoDB.

How Consistent Hashing Works

Instead of a linear mapping of hash values to servers, consistent hashing maps both the servers and the keys onto a circular space, often called a hash ring.

Here's the process:

  1. Map Servers to the Ring: Each server is assigned a position on the ring based on a hash of its identifier (e.g., its IP address or name).
  2. Map Keys to the Ring: When a piece of data needs to be stored, you hash its key to get a position on the same ring.
  3. Find the Next Server: To determine which server should store the key, you travel clockwise around the ring from the key's position until you find the first server. That server is responsible for that key.
A diagram showing keys and servers mapped to a hash ring. Each key is assigned to the next server found clockwise on the ring.Hash RingS1S2S3K1K2K3K4

The Magic of Adding and Removing Servers

The real power of consistent hashing becomes apparent when you change the number of servers.

Adding a Server

When you add a new server (e.g., S4), it gets hashed to a new position on the ring.

  • The only keys that are affected are the ones that now fall between the new server (S4) and the server immediately counter-clockwise to it (S3 in the diagram).
  • Only the keys that S3 was previously responsible for now need to be redistributed. A portion of them will now be assigned to S4.
  • The keys on all other servers (S1 and S2) are completely unaffected.
A diagram showing a new server (S4) being added to the hash ring, and how only a subset of keys need to be remapped.Hash RingKeys in this range areremapped from S1 to S4S1S2S3S4(New)K1K2K3K4

Removing a Server

When a server fails or is removed (e.g., S1), its keys need to be reassigned.

  • Under the consistent hashing scheme, all of S1's keys are simply assigned to the next server clockwise on the ring (S2).
  • Again, the keys on all other servers (S3 and S4) are completely unaffected.

The key takeaway: With consistent hashing, adding or removing a server only affects its immediate neighbors on the ring. This minimizes the amount of data that needs to be moved, making the system much more scalable and resilient.

Improving the Basic Model: Virtual Nodes

The basic consistent hashing algorithm has one potential problem: if servers are not distributed evenly on the ring, it can lead to an uneven distribution of data. One server might end up with a much larger share of the keys than others, creating a hotspot.

To solve this, we introduce the concept of virtual nodes (or replicas).

  • Instead of mapping each server to a single point on the ring, we map it to multiple virtual nodes, each at a different position.
  • For example, if we have 3 servers and use 3 virtual nodes for each, we would place 9 virtual nodes on the ring (S1a, S1b, S1c, S2a, S2b, S2c, etc.).
  • When a key is assigned, it's assigned to the next virtual node on the ring, which then maps back to a physical server.
A diagram showing how keys and virtual nodes are placed on a hash ring. Arrows show the clockwise lookup from a key to its assigned virtual node.Hash RingS1aS1bS1cS2aS2bS2cS3aS3bS3cK1K2K3K4K5K6

Benefits of Virtual Nodes:

  1. More Uniform Data Distribution: By increasing the number of points on the ring, the keys are spread out much more evenly across the physical servers, reducing the risk of hotspots.
  2. Smoother Rebalancing: When a server is added or removed, the load is distributed more evenly among the remaining servers, as the removed server's virtual nodes were likely spread all around the ring.
  3. Heterogeneity: It allows you to account for servers with different capacities. A more powerful server can be assigned more virtual nodes, so it will receive a proportionally larger share of the data.

In a system design interview, you don't need to implement the consistent hashing algorithm from scratch. However, you should be able to explain what it is, what problem it solves, and why it's superior to a simple modulo-based hashing approach for any system that requires dynamic scaling. Mentioning virtual nodes as a refinement to the basic algorithm is a great way to show a deeper level of understanding.