Median from Stream
Heaps (Priority Queues)
Life rarely stops to let us think. Like a running median, we have to stay centered as the data flows.
Imagine you're building a system to monitor real-time transaction amounts and you need to display the running median price. As thousands of new transactions pour in every second, re-sorting the entire list to find the middle value is not an option. You need a way to instantly access the median as the data grows.
Problem: Find Median from Data Stream (LeetCode Link): Design a data structure that supports adding new numbers and finding the median of all elements seen so far. 💡 HintHow can you divide the numbers into two groups, a "small half" and a "large half," and always have access to the numbers right in the middle?
The Challenge of the Median
The median is the value that separates the higher half of a dataset from the lower half.
- If the dataset has an odd number of elements, the median is the single middle element.
- If it has an even number of elements, the median is the average of the two middle elements.
A naive approach of sorting the list every time a new number is added would be for each addition, which is far too slow for a data stream.
The Two-Heap Solution
The key insight is to maintain two heaps that represent the two halves of the dataset:
- A Max-Heap (
small_half
): To store the smaller half of the numbers. The largest of these numbers will always be at the top, ready to be accessed. - A Min-Heap (
large_half
): To store the larger half of the numbers. The smallest of these numbers will always be at the top.
By keeping these two heaps balanced, the median will always be available right at their peaks.
How it Works
We maintain an invariant: the heaps are always balanced, or the small_half
(max-heap) is allowed to have just one more element than the large_half
(min-heap).
For each new number num
:
- Add to a Heap: We can start by adding
num
to thesmall_half
(max-heap). - Balance: The new number might violate our heap property (e.g., a large number was added to the
small_half
). To fix this, we immediately take the largest element fromsmall_half
(its root) and move it to thelarge_half
(min-heap). - Rebalance: Now, the heaps might be unbalanced in size. If the
large_half
has more elements than thesmall_half
, we take the smallest element fromlarge_half
(its root) and move it back tosmall_half
.
After these steps, the heaps are balanced, and the median can be found in time.
Finding the Median
- If the total number of elements is odd, the
small_half
will have one more element than thelarge_half
. The median is simply the root of thesmall_half
. - If the total number of elements is even, both heaps will have the same size. The median is the average of the roots of both heaps.
This process gives us time for each number added and time for finding the median.
Walkthrough: Implementing the MedianFinder
import heapq
class MedianFinder:
def __init__(self):
# max-heap for the smaller half (negate values for min-heap)
self.small_half = []
# min-heap for the larger half
self.large_half = []
def addNum(self, num: int) -> None:
# 1. Add to small_half (max-heap)
heapq.heappush(self.small_half, -1 * num)
# 2. Balance: move largest from small_half to large_half
if self.small_half and self.large_half and (-1 * self.small_half[0]) > self.large_half[0]:
val = -1 * heapq.heappop(self.small_half)
heapq.heappush(self.large_half, val)
# 3. Rebalance size
if len(self.small_half) > len(self.large_half) + 1:
val = -1 * heapq.heappop(self.small_half)
heapq.heappush(self.large_half, val)
if len(self.large_half) > len(self.small_half) + 1:
val = heapq.heappop(self.large_half)
heapq.heappush(self.small_half, -1 * val)
def findMedian(self) -> float:
if len(self.small_half) > len(self.large_half):
return -1 * self.small_half[0]
if len(self.large_half) > len(self.small_half):
return self.large_half[0]
return (-1 * self.small_half[0] + self.large_half[0]) / 2.0
import java.util.PriorityQueue;
import java.util.Collections;
class MedianFinder {
private PriorityQueue<Integer> small_half; // Max-heap
private PriorityQueue<Integer> large_half; // Min-heap
public MedianFinder() {
small_half = new PriorityQueue<>(Collections.reverseOrder());
large_half = new PriorityQueue<>();
}
public void addNum(int num) {
small_half.offer(num);
large_half.offer(small_half.poll());
if (large_half.size() > small_half.size()) {
small_half.offer(large_half.poll());
}
}
public double findMedian() {
if (small_half.size() > large_half.size()) {
return small_half.peek();
} else {
return (small_half.peek() + large_half.peek()) / 2.0;
}
}
}
#include <queue>
#include <vector>
class MedianFinder {
std::priority_queue<int> small_half; // Max-heap
std::priority_queue<int, std::vector<int>, std::greater<int>> large_half; // Min-heap
public:
void addNum(int num) {
small_half.push(num);
large_half.push(small_half.top());
small_half.pop();
if (large_half.size() > small_half.size()) {
small_half.push(large_half.top());
large_half.pop();
}
}
double findMedian() {
if (small_half.size() > large_half.size()) {
return small_half.top();
} else {
return (small_half.top() + large_half.top()) / 2.0;
}
}
};
More Problems & Variations
- Sliding Window Median (LeetCode Link): This is a significantly harder variation. You need to find the median of only the elements within a moving window. 💡 HintThe two-heap approach is still the way to go, but you now need an efficient way to remove elements from the heaps as they slide out of the window. Standard heaps don't support removal, so this requires a more advanced technique, like using a hash map to mark elements for lazy deletion.
Bonus Point: When Not to Use This
While the two-heap strategy is excellent for a running median, it's not always the best tool. If you have a static, unchanging dataset, you don't need the complexity of heaps. Simply sorting the array and picking the middle element () is often sufficient and easier to implement. For finding the median of a static array in time on average, you can also use the Quickselect algorithm (the same one mentioned in the "Top K" chapter).