System DesignData in a Distributed WorldData Partitioning (Sharding)

Data Partitioning (Sharding)

Data in a Distributed World

In the Scalability chapter, we discussed that while application servers are easy to scale horizontally, databases are often scaled vertically. However, there is a limit to vertical scaling. Eventually, a single database server will not be able to handle the amount of data or the volume of read/write traffic.

Data partitioning, also known as sharding, is the primary technique for scaling a database horizontally. It involves breaking up a large database into smaller, more manageable pieces called shards and spreading them across multiple database servers.

Each shard is an independent database, holding a subset of the total data.

Why Shard Your Database?

  1. Overcome Storage Limits: A single server can only store a finite amount of data. Sharding allows you to store a virtually unlimited amount of data by adding more servers.
  2. Improve Performance: By distributing the data, you also distribute the query load. If your queries are directed to the correct shard, each server only has to deal with a fraction of the total traffic, leading to faster response times.
  3. Increase Write Throughput: In a single-leader replication setup, all writes must go to one server. Sharding allows for parallel writes across multiple shards, dramatically increasing the system's overall write throughput.
  4. Improve Availability: If one shard goes down, it only affects the data on that shard. The rest of the database remains available.

Common Sharding Strategies

The key to a successful sharding implementation is the sharding key (or partition key). This is a piece of data from your records (e.g., user_id, zip_code, timestamp) that is used to determine which shard a particular record belongs to.

1. Algorithmic (or Hashed) Sharding

How it works: You apply a hash function to the sharding key and then use the output of that hash to determine the shard. A naive approach is shard_id = hash(sharding_key) % number_of_shards.

  • Example: If you have 4 shards and you are sharding by user_id, you would calculate hash(user_id) % 4. The result (0, 1, 2, or 3) would be the ID of the shard where that user's data is stored.

Pros:

  • Uniform Distribution: A good hash function will distribute the data evenly across the shards, preventing "hotspots" (shards that have more data or traffic than others).
  • Simple to Implement: The logic is straightforward.

Cons:

  • Resharding is Difficult: The biggest problem with this approach is what happens when you need to add or remove shards. If you change the number of shards, the result of the modulo operation changes for nearly every key. This means you would have to re-distribute and move almost all of your data, which is a massive and complex operation. This problem is often solved by using Consistent Hashing.

2. Dynamic (or Range-Based) Sharding

How it works: You partition the data based on a range of values in the sharding key.

  • Example: You could have one shard for users with names starting A-F, another for G-M, and so on. Or one shard for users in the USA, another for users in Europe.

Pros:

  • Good for Range Queries: This approach is very efficient for queries that need to access a contiguous range of data. For example, "find all users with a zip code between 90210 and 90215." All of this data would likely live on the same shard.

Cons:

  • Prone to Hotspots: It can easily lead to an uneven distribution of data. If you have many more users in the USA than in Europe, the USA shard will be much larger and busier. This requires careful selection of the sharding key and ranges.
  • Requires a Lookup Table: You need to maintain a lookup table or service that maps ranges to shards.

3. Entity-Based (or Directory-Based) Sharding

How it works: You group related data together on the same shard. This is also known as creating sharded tenants.

  • Example: In a social media application, you might decide that all of a user's data—their profile, posts, comments, and followers—should live on the same shard. The user_id would be the sharding key.

Pros:

  • Strong Data Locality: All the data needed for a specific entity is on one server. This makes queries for that entity very fast, as no cross-shard joins are needed.
  • Good Isolation: The activity of one user (e.g., a celebrity with millions of followers) will only impact their own shard, not the entire system.

Cons:

  • Can Still Lead to Hotspots: If one user is extremely active, their shard can become a hotspot.
  • Cross-Entity Queries are Difficult: Queries that involve multiple users (e.g., "find all users who liked posts from these 10 different users") can be very inefficient, as they may require querying multiple shards and combining the results at the application layer.

Common Problems with Sharding

Sharding is a powerful technique, but it introduces significant complexity.

  • Cross-Shard Joins: Performing joins across different shards is very expensive and often not supported by the database. This is why it's so important to choose a sharding key that keeps related data together.
  • Transactions: ACID-compliant transactions across multiple shards are extremely difficult to implement.
  • Resharding: As mentioned, adding or removing shards can be a major challenge, often requiring significant downtime or a very complex migration strategy.

In a system design interview, when your data or traffic volume grows beyond what a single server can handle, you should propose sharding. Be prepared to discuss which sharding strategy you would choose and, most importantly, why you would choose it based on the specific requirements of the application.