Heaps (Priority Queues)
Life rarely stops to let us think. Like a running median, we have to stay centered as the data flows.
Imagine you're building a system to monitor real-time transaction amounts and you need to display the running median price. As thousands of new transactions pour in every second, re-sorting the entire list to find the middle value is not an option. You need a way to instantly access the median as the data grows.
Problem: Find Median from Data Stream (LeetCode Link): Design a data structure that supports adding new numbers and finding the median of all elements seen so far. 💡 HintHow can you divide the numbers into two groups, a "small half" and a "large half," and always have access to the numbers right in the middle?
The median is the value that separates the higher half of a dataset from the lower half.
A naive approach of sorting the list every time a new number is added would be for each addition, which is far too slow for a data stream.
The key insight is to maintain two heaps that represent the two halves of the dataset:
small_half): To store the smaller half of the numbers. The largest of these numbers will always be at the top, ready to be accessed.large_half): To store the larger half of the numbers. The smallest of these numbers will always be at the top.By keeping these two heaps balanced, the median will always be available right at their peaks.
We maintain an invariant: the heaps are always balanced, or the small_half (max-heap) is allowed to have just one more element than the large_half (min-heap).
For each new number num:
num to the small_half (max-heap).small_half). To fix this, we immediately take the largest element from small_half (its root) and move it to the large_half (min-heap).large_half has more elements than the small_half, we take the smallest element from large_half (its root) and move it back to small_half.After these steps, the heaps are balanced, and the median can be found in time.
small_half will have one more element than the large_half. The median is simply the root of the small_half.This process gives us time for each number added and time for finding the median.
import heapq
class MedianFinder:
def __init__(self):
# max-heap for the smaller half (negate values for min-heap)
self.small_half = []
# min-heap for the larger half
self.large_half = []
def addNum(self, num: int) -> None:
# 1. Add to small_half (max-heap)
heapq.heappush(self.small_half, -1 * num)
# 2. Balance: move largest from small_half to large_half
if self.small_half and self.large_half and (-1 * self.small_half[0]) > self.large_half[0]:
val = -1 * heapq.heappop(self.small_half)
heapq.heappush(self.large_half, val)
# 3. Rebalance size
if len(self.small_half) > len(self.large_half) + 1:
val = -1 * heapq.heappop(self.small_half)
heapq.heappush(self.large_half, val)
if len(self.large_half) > len(self.small_half) + 1:
val = heapq.heappop(self.large_half)
heapq.heappush(self.small_half, -1 * val)
def findMedian(self) -> float:
if len(self.small_half) > len(self.large_half):
return -1 * self.small_half[0]
if len(self.large_half) > len(self.small_half):
return self.large_half[0]
return (-1 * self.small_half[0] + self.large_half[0]) / 2.0
import java.util.PriorityQueue;
import java.util.Collections;
class MedianFinder {
private PriorityQueue<Integer> small_half; // Max-heap
private PriorityQueue<Integer> large_half; // Min-heap
public MedianFinder() {
small_half = new PriorityQueue<>(Collections.reverseOrder());
large_half = new PriorityQueue<>();
}
public void addNum(int num) {
small_half.offer(num);
large_half.offer(small_half.poll());
if (large_half.size() > small_half.size()) {
small_half.offer(large_half.poll());
}
}
public double findMedian() {
if (small_half.size() > large_half.size()) {
return small_half.peek();
} else {
return (small_half.peek() + large_half.peek()) / 2.0;
}
}
}
#include <queue>
#include <vector>
class MedianFinder {
std::priority_queue<int> small_half; // Max-heap
std::priority_queue<int, std::vector<int>, std::greater<int>> large_half; // Min-heap
public:
void addNum(int num) {
small_half.push(num);
large_half.push(small_half.top());
small_half.pop();
if (large_half.size() > small_half.size()) {
small_half.push(large_half.top());
large_half.pop();
}
}
double findMedian() {
if (small_half.size() > large_half.size()) {
return small_half.top();
} else {
return (small_half.top() + large_half.top()) / 2.0;
}
}
};
While the two-heap strategy is excellent for a running median, it's not always the best tool. If you have a static, unchanging dataset, you don't need the complexity of heaps. Simply sorting the array and picking the middle element () is often sufficient and easier to implement. For finding the median of a static array in time on average, you can also use the Quickselect algorithm (the same one mentioned in the "Top K" chapter).