Median from Stream

Life rarely stops to let us think. Like a running median, we have to stay centered as the data flows.

Imagine you're building a system to monitor real-time transaction amounts and you need to display the running median price. As thousands of new transactions pour in every second, re-sorting the entire list to find the middle value is not an option. You need a way to instantly access the median as the data grows.

Problem: Find Median from Data Stream (LeetCode Link): Design a data structure that supports adding new numbers and finding the median of all elements seen so far. 💡 Hint

The Challenge of the Median

The median is the value that separates the higher half of a dataset from the lower half.

If the dataset has an odd number of elements, the median is the single middle element.
If it has an even number of elements, the median is the average of the two middle elements.

A naive approach of sorting the list every time a new number is added would be $O(N log N)$ for each addition, which is far too slow for a data stream.

The Two-Heap Solution

The key insight is to maintain two heaps that represent the two halves of the dataset:

A Max-Heap (small_half): To store the smaller half of the numbers. The largest of these numbers will always be at the top, ready to be accessed.
A Min-Heap (large_half): To store the larger half of the numbers. The smallest of these numbers will always be at the top.

By keeping these two heaps balanced, the median will always be available right at their peaks.

How it Works

We maintain an invariant: the heaps are always balanced, or the small_half (max-heap) is allowed to have just one more element than the large_half (min-heap).

For each new number num:

Add to a Heap: We can start by adding num to the small_half (max-heap).
Balance: The new number might violate our heap property (e.g., a large number was added to the small_half). To fix this, we immediately take the largest element from small_half (its root) and move it to the large_half (min-heap).
Rebalance: Now, the heaps might be unbalanced in size. If the large_half has more elements than the small_half, we take the smallest element from large_half (its root) and move it back to small_half.

After these steps, the heaps are balanced, and the median can be found in $O(1)$ time.

Finding the Median

If the total number of elements is odd, the small_half will have one more element than the large_half. The median is simply the root of the small_half.
If the total number of elements is even, both heaps will have the same size. The median is the average of the roots of both heaps.

This process gives us $O(log N)$ time for each number added and $O(1)$ time for finding the median.

Walkthrough: Implementing the MedianFinder

import heapq

class MedianFinder:
    def __init__(self):
        # max-heap for the smaller half (negate values for min-heap)
        self.small_half = []  
        # min-heap for the larger half
        self.large_half = []

    def addNum(self, num: int) -> None:
        # 1. Add to small_half (max-heap)
        heapq.heappush(self.small_half, -1 * num)

        # 2. Balance: move largest from small_half to large_half
        if self.small_half and self.large_half and (-1 * self.small_half[0]) > self.large_half[0]:
            val = -1 * heapq.heappop(self.small_half)
            heapq.heappush(self.large_half, val)

        # 3. Rebalance size
        if len(self.small_half) > len(self.large_half) + 1:
            val = -1 * heapq.heappop(self.small_half)
            heapq.heappush(self.large_half, val)
        if len(self.large_half) > len(self.small_half) + 1:
            val = heapq.heappop(self.large_half)
            heapq.heappush(self.small_half, -1 * val)

    def findMedian(self) -> float:
        if len(self.small_half) > len(self.large_half):
            return -1 * self.small_half[0]
        if len(self.large_half) > len(self.small_half):
            return self.large_half[0]
        
        return (-1 * self.small_half[0] + self.large_half[0]) / 2.0

import java.util.PriorityQueue;
import java.util.Collections;

class MedianFinder {
    private PriorityQueue<Integer> small_half; // Max-heap
    private PriorityQueue<Integer> large_half; // Min-heap

    public MedianFinder() {
        small_half = new PriorityQueue<>(Collections.reverseOrder());
        large_half = new PriorityQueue<>();
    }

    public void addNum(int num) {
        small_half.offer(num);
        large_half.offer(small_half.poll());
        if (large_half.size() > small_half.size()) {
            small_half.offer(large_half.poll());
        }
    }

    public double findMedian() {
        if (small_half.size() > large_half.size()) {
            return small_half.peek();
        } else {
            return (small_half.peek() + large_half.peek()) / 2.0;
        }
    }
}

#include <queue>
#include <vector>

class MedianFinder {
    std::priority_queue<int> small_half; // Max-heap
    std::priority_queue<int, std::vector<int>, std::greater<int>> large_half; // Min-heap

public:
    void addNum(int num) {
        small_half.push(num);
        large_half.push(small_half.top());
        small_half.pop();
        if (large_half.size() > small_half.size()) {
            small_half.push(large_half.top());
            large_half.pop();
        }
    }

    double findMedian() {
        if (small_half.size() > large_half.size()) {
            return small_half.top();
        } else {
            return (small_half.top() + large_half.top()) / 2.0;
        }
    }
};

Bonus Point: When Not to Use This

While the two-heap strategy is excellent for a running median, it's not always the best tool. If you have a static, unchanging dataset, you don't need the complexity of heaps. Simply sorting the array and picking the middle element ( $O(N log N)$ ) is often sufficient and easier to implement. For finding the median of a static array in $O(N)$ time on average, you can also use the Quickselect algorithm (the same one mentioned in the "Top K" chapter).