Rate Limiting - PrepKit

A rate limiter is a mechanism that controls the rate of traffic sent or received by a network interface controller. In the context of system design, it's a tool used to control the number of requests a user or client can make to an API within a certain time window.

Rate limiting is a critical defense for your services. It's used to prevent resource starvation and ensure the availability and reliability of your system.

Why Do You Need Rate Limiting?

Prevent Denial-of-Service (DoS) Attacks: A malicious actor could try to overwhelm your service by sending a huge number of requests in a short period of time. A rate limiter can block such an attack at the source.
Prevent Resource Starvation from Bugs: It's not always malicious. A bug in a client application could cause it to send requests in an infinite loop, accidentally overwhelming your service.
Ensure Fair Usage and Quality of Service: For public APIs, rate limiting ensures that no single user can monopolize the system's resources, which could degrade the service for other users.
Manage Costs: In a cloud environment where you pay per request or per CPU cycle, rate limiting can prevent your costs from spiraling out of control due to unexpected traffic spikes.
Control Downstream Service Load: A service might impose a rate limit to ensure it doesn't overload a downstream service or database that it depends on.

Where to Implement a Rate Limiter

Rate limiting can be implemented at various levels:

On the Client: The client can be designed to self-impose a limit on how often it calls an API. This is good practice, but you can't trust clients to behave correctly.
In an API Gateway: This is the most common and effective place to implement rate limiting. The API Gateway is the single entry point to your system, so it can enforce limits for all incoming traffic before it even reaches your services.
On the Service Itself: A service can have its own rate limiter to protect itself and its downstream dependencies.

Common Rate Limiting Algorithms

The core of a rate limiter is an algorithm that tracks the number of requests and the time they were made. Here are some of the most common algorithms:

1. Fixed Window Counter

How it works:

The timeline is divided into fixed-size time windows (e.g., one minute).
A counter is maintained for each client for each window.
If the counter exceeds the limit within the window, further requests are rejected.
At the end of the window, the counter is reset.

Pros:

Very simple to implement.

Cons:

Vulnerable to Bursts at the Edges: A client could send a burst of requests at the end of one window and another burst at the beginning of the next window, effectively sending twice the allowed number of requests in a short period. For example, if the limit is 100 requests/minute, a client could send 100 requests at 0:59 and another 100 at 1:01.

2. Sliding Window Log

How it works:

The rate limiter stores a timestamp for each request from a client in a list (or log).
When a new request comes in, the algorithm removes all timestamps from the list that are older than the current time window.
It then counts the number of remaining timestamps in the list. If the count is less than the limit, the new request is accepted and its timestamp is added to the list. Otherwise, it's rejected.

Pros:

It provides a very accurate rate limit and solves the edge burst problem of the fixed window counter.

Cons:

It can be very memory-intensive, as you have to store a timestamp for every single request.

3. Sliding Window Counter

This is a hybrid approach that combines the low memory usage of the fixed window counter with the better accuracy of the sliding window log.

How it works:

It maintains a counter for the current window and also considers a weighted value of the counter from the previous window.
For example, if a request arrives at 30% of the way through the current window, the algorithm might calculate the current rate as (70% * counter_in_previous_window) + counter_in_current_window.
This provides a rolling average that smooths out the burstiness issue of the fixed window counter.

Pros:

Good balance between performance, memory usage, and accuracy.

Cons:

It's an approximation, not perfectly accurate, but it's often good enough for most use cases.

4. Token Bucket

This is a very popular and effective algorithm.

How it works:

A "bucket" is created for each client. The bucket has a certain capacity (the burst limit).
Tokens are added to the bucket at a fixed, regular rate (the replenish rate).
When a request arrives, it must take one token from the bucket to be processed.
If the bucket is empty, the request is rejected.

Pros:

Smooths out traffic: It naturally handles bursts of traffic (up to the bucket's capacity) while enforcing an average rate over time.
Easy to implement on a distributed system.
Conceptually simple.

Distributed Rate Limiting

In a distributed system, you can't store the counters or token buckets in the memory of a single server. If you have multiple API Gateway instances, they need to coordinate to enforce a global rate limit for a specific client.

The standard solution is to use a centralized, high-speed data store like Redis.

Each gateway instance, before processing a request, would make a call to Redis to fetch the current counter or number of tokens for the client.
It would then use an atomic increment operation in Redis to update the value.
Because Redis is extremely fast, this adds minimal latency while ensuring that the rate limit is applied consistently across all gateways.

In a system design interview, when you design an API, you should always consider rate limiting. Mentioning that you would place a rate limiter at the API Gateway and use a token bucket algorithm with a centralized Redis store is a very strong and complete answer.