Python's selection algorithm – A good choice based on bad experiments

While researching the Wikipedia article on median selection algorithms (now a Good Article) I discovered that Python’s implementation, heapq.nsmallest, changed its algorithm to one whose worst-case performance is worse than its previous implementation, based on an empirical analysis that appears to consider only its best-case performance.

The current implementation of heapq can be found on github, with its version history. The nsmallest and nlargest routines were added in 2004, using two different algorithms! For nsmallest, the algorithm was heapselect: create a single \(n\)-item binary heap and then pop \(k\) items from it. Its worst-case running time is \(O(n+k\log n)\). For nlargest, the original commit instead used an algorithm for maintaining the running \(k\) largest items:

Initialize a heap with the first \(k\) items
For each remaining item \(x\), if \(x\) is larger than the heap minimum, pop the heap and add \(x\) in its place.

Because the heap only stores \(k\) items, each heap operation takes time \(O(\log k)\). In the worst case of sorted data, each item triggers a heap operation, and the total time is \(O(n\log k)\). This is always worse than heapselect. It only improves on the running time of sorting when \(k\) is a polynomial fraction of \(n\), \(k=O(n^c)\), for some constant exponent \(c<1\). The precise choice of this exponent depends on the tradeoff between the smaller argument of the logarithm versus the worse constant factor of heapsort relative to other comparison sorting algorithms.

A 2013 update to the Python implementation removed the old heapselect nsmallest implementation (with better worst-case runtime) replacing it with the same algorithm as nlargest (with the slower worst case). Subsequent edits made minor optimizations and added the extended comment that can be seen in the current version, starting near line 400, pointing to algorithmic experiments performed in 2011 by Raymond Hettinger appearing to show that the \(O(n\log k)\) algorithm is faster than the \(O(n+k\log n)\) algorithm!

What’s going on here?

First, Hettinger’s experiments are bad. They run only on lists of integers, in sorted order. (Actually they’re an object that acts like an integer but counts its comparisons, but that makes no difference to this argument.) When using running selection for nsmallest, this is the best case, because none of the items will trigger a heap update. The algorithm only performs one comparison per item after the initial \(k\)-item heapify. In contrast, heapselect initializes a heap with \(n\) items, which takes \(\approx 2n\) comparisons in the worst case (Hettinger quotes a better constant factor of \(1.66n\) that appears to be drawn from his experiments on this special case rather than from any general analysis). Of course the running selection algorithm wins on this best-case data. But that doesn’t tell you anything about real-world usage. If you knew your data were already sorted you would just select its first \(k\) items directly rather than using this algorithm.

Second, it may well be the case that the running selection algorithm is actually the right algorithm to use, at least for small \(k\), because it uses few heap operations (but not zero!) in many cases. In the worst case each item triggers a heap update, but for other orderings of the data these updates are unlikely. A heap update only happens, for the \(i\)th item of the input, when this item is one of the \(k\) smallest items, among the first \(i\) items. For a uniformly random permutation of the data, each of the first \(i\) items is equally likely to be in one of these smallest positions, with probability \(k/i\). By linearity of expectation, the expected number of heap updates, on random or randomly-ordered data, is the sum of the update probabilities for individual items,

\[\sum_{i=k+1}^n \frac{k}{i}\approx k\ln\frac{n}{k}.\]

Each heap update takes time \(O(\log k)\) based on the size of the heap. So the expected total number of comparisons for running selection, in this average case, is \(n+O(k\log(n/k)\log k)\). (There is also an \(O(k\log k)\) term coming from the fact that nsmallest sorts its output but that is swallowed by the \(O\)-notation.) There is some tradeoff between heapselect and running selection: for very small \(k\), running selection wins, because it has a better constant on the leading \(n\) term. For large enough \(k\), the other terms in the running time dominate, and heapselect wins because it has only one logarithm in its time bound but running select has two. For even larger \(k\), you might as well just sort the input. But Python’s nsmallest is aimed at small values of \(k\), for which running selection wins.

Unfortunately I couldn’t find any sources stating all this, beyond the original source code. There are books that discuss Python’s nsmallest routine but they either say it can do selection in time \(O(n\log k)\) (true but not the whole story) or that it should be avoided unless \(k\) is small relative to \(n\). One can probably find other sources analyzing running selection in the average case (for instance as an implementation of MinHash) but they’re not going to connect it to Python’s implementation.

Finally, suppose you want a simple and implementable deterministic comparison-based selection algorithm for the \(k\) smallest items that has the same \(n+f(k)\) average-case number of comparisons and \(O(k)\) space that you get from running selection. Is the Python algorithm the best choice? Does it have the best \(f(k)\)? Here’s an alternative:

Initialize two buffers \(A\) and \(B\) of size \(k\), where \(A\) contains a sorted list of the first \(k\) items, and \(B\) starts empty. Maintain a number \(a=\max A\).
For each remaining item that is smaller than \(a\), add it to \(B\).
If \(B\) becomes full, or at the end of the algorithm, sort \(B\) and then merge it with \(A\), keeping only the \(k\) smallest items in the merged sorted list, and recompute \(a\).

The weaker test for whether to add an item to \(B\) (based on \(A\) rather than on the current \(k\) smallest) increases the expected number of added items by less than \(k\), not enough to affect the analysis significantly. Each added item participates in a sorting step, so we get the same \(n+O(k\log(n/k)\log k)\) comparison bound as Python’s implementation. We use twice the space because of the extra buffer, but in exchange the sorting step uses your favorite fast comparison sorting algorithm in place of two binary heap operations per item, so the constant factor on the second term should be smaller. We could also replace the sorting-based buffer decimation by a method based on linear time selection, resulting in a method that is truly linear time in the worst case and still \(n+O(k\log(n/k))\) in the average case, but making that both practical and deterministic appears to be difficult.

(Discuss on Mastodon)