Entropy in algorithm analysis

What follows are some half-baked ideas for how the quantity called “entropy” used in some forms of algorithm analysis for algorithms including sorting and convex hulls can justify being called entropy.

The simplest of these is the run-based entropy used to analyze some comparison-based sorting algorithms. Here, a run in an unsorted input sequence is a subsequence of the input that is already sorted (let’s say, in the same ascending order that you are trying to produce for the whole sequence. If one partitions a length-\(n\) input into \(r\) maximal runs of sizes \(n_i\), then the entropy of the input is defined to be

\[\mathrm{H}=-\sum_{i=1}^r \frac{n_i}{n}\log_2 \frac{n_i}{n}.\]

(That \(\mathrm{H}\) is intended as a Greek capital letter eta, and I’m using \(\log_2\) to measure entropy in bits rather than nats or dits or whatever.) One can sort in time \(O\bigl(n(\mathrm{H}+1)\bigr)\), by repeatedly merging the shortest pair of sequences (analogous to the construction of an optimal Huffman code), or by other methods that try to merge short sequences but not necessarily in an optimal way, like Timsort. Here the \(+1\) is there to account for the fact that, even when \(\mathrm{H}<1\), you have to spend constant time per input just to look at it and output it.

This entropy formula bears an obvious syntactic resemblance to the entropy of a discrete probability distribution, whose \(r\) outcomes have probabilities \(p_i\) summing to one,

\[\mathrm{H}=-\sum_{i=1}^r p_i\log_2 p_i.\]

It’s what you would get if you picked random elements of your input sequence and asked what probability they have of belonging to each run. But is this a meaningful thing to do for a sorting algorithm? For a better answer to what entropy means here, I think we need to shift from probability theory to information theory. If I told you that an input had a special structure, \(r\) runs of size \(n_i\), how many bits of information would it take to determine the correct sorted output among all the other possible outputs? This is just the binary logarithm of the number of inputs that have this special structure. The entropy is then the average number of bits that remain to be determined per input value.

An input having this special structure can be generated by starting with the sorted output and then unsorting it. Once we specify which element of the sorted output belongs to which run (the first run, the second, the third, etc), we know everything of relevance about the input: we can permute any sorted output into an input matching this specification by writing down the elements of each run in the same order that they appear in the output.

Unsorting a sorted sequence of items, labeled by which run each item belongs to, to produce an input to the sorting problem in which the runs of consecutive sorted elements have predetermined numbers of items

So then the question becomes, how many ways are there of specifying membership in runs, so that we get the given number of elements \(n_i\) in each run? And some manipulation of the multinomial theorem shows that this is roughly \(2^{n\mathrm{H}}\), where the “roughly” means to within low-order terms in the exponent. Because it takes this many bits to specify a correct output, any algorithm for sorting by comparisons must make this many comparisons, or it will not have enough distinct outputs, causing it to produce the same output for two inputs that should have distinct outputs. Because of this lower bound, and the trivial \(\Omega(n)\) lower bound, algorithms that can sort in time \(O\bigl(n(\mathrm{H}+1)\bigr)\) are optimal (to within constant factors) for this input assumption.

Now let’s move on to convex hulls. To get to something like entropy, we need to find an output specification whose number of bits correlates with algorithm complexity, and the obvious specification doesn’t work. If the convex hull has \(h\) vertices, and the output is a list of these vertices in consecutive order around the hull, then it takes \(O(h\log n)\) bits to write the output. But in the dubiously-titled “The ultimate planar convex hull algorithm?”, David Kirkpatrick and Raimund Seidel proved that any algorithm based on comparing polynomials of input coordinates requires a much larger number of comparisons, \(\Omega(n\log h)\). They and several later even-more-ultimate works also proved that this larger number is achievable.

This issue is related to the fact that the output to a sorting algorithm is easily checked: if you know that the output is a permutation of the input, you just have to know that consecutive outputs are in the right order, something that can be checked much more quickly than the sorting algorithm itself. But the convex hull, specified as its sorted list of vertices, is not as easy to check. Some of the points are likely to be missing, and they need some reason to be missing. A convex hull algorithm is going to find such a reason, but to make the output checkable this reason needs to be communicated. To keep things simpler, let’s only consider the upper convex hull (the vertices from which an upward vertical ray would miss the hull). An input point \(p\) is not an upper hull vertex if and only if \(p\) belongs to the trapezoid below some two other input points. So let’s specify an output to a checkable upper convex hull algorithm to be the list of hull vertices in left-to-right order, together with a collection of trapezoids defined by pairs of inputs, and an assignment of all non-hull points to trapezoids that contain them. One way of making such an assignment would be to partition the convex hull into trapezoids below its convex hull edges, but other partitions may be better.

Suppose we know that a given input can be described by an output of this form with \(t\) trapezoids, \(h\) convex hull vertices, and a certain number of points per trapezoid. Let \(r=t+h\) be the number of convex hull vertices, let \(n_i\) be one for an index \(i\le r\) corresponding to a convex hull vertex, and let \(n_i\) be the number of covered points for an index corresponding to a trapezoid. Then we can define the “trapezoid entropy” to be the same \(\mathrm{H}\) defined from the same formula above, and specify the output with \(n\mathrm{H}+2t\log_2 n\) bits of information. The \(n\mathrm{H}\) part of this formula is how much information it takes to pick out the hull vertices and assign the remaining points to trapezoids, but we also need \(2t\log_2 n\) more bits to specify the identity of the segment above each trapezoid. If we ask for time \(O\bigl(n(\mathrm{H}+1)\bigr)\) as before, the extra \(+1\) will cover this part of the formula and we can omit it.

This time bound is obtained by “Instance-optimal geometric algorithms” [Afshani, Barbay, and Chan, JACM 2017], only they don’t use trapezoid entropy. Instead, they ask to assign the input points to triangles that are entirely below the convex hull, and define the “structural entropy” in the same way from these triangles. They show that for the structural entropy, \(O\bigl(n(\mathrm{H}+1)\bigr)\) time is possible, and \(\Omega\bigl(n(\mathrm{H}+1)\bigr)\) time is necessary in a multilinear (not fully algebraic) decision tree model of computing. Their structural entropy has an entropy-like formula but I don’t know how to interpret it information-theoretically as a number of outputs for some reasonable output model. The assignment of points to triangles is not a reasonable output because it doesn’t immediately tell you why the unlisted points are unlisted; you still have to do too much work to verify that each triangle really is below the convex hull. Fortunately, though, their algorithm does produce a trapezoid cover, and their structural entropy is within a constant factor of the trapezoid entropy. You can divide any trapezoid into two triangles, and you can cover any triangle below the convex hull with at most four trapezoids: the ones from the hull edges covering the leftmost and rightmost corners of the triangle, and two more trapezoids covering the parts between them.

It’s also possible to use run-based entropy for convex hulls. The Graham scan convex hull algorithm runs in linear time on inputs sorted by their \(x\)-coordinates, so if an input can be partitioned into a small number of sorted runs of points, using an entropy-sensistive sorting algorithm and then Graham scan will be fast. This kind of entropy is quite different from trapezoid-based or structural entropy, though: it depends only on the \(x\)-coordinates of the points, and the order in which they are given to the algorithm, but trapezoid-based entropy is independent of ordering and depends on the two-dimensional geometry of the input. Both methods have the idea of partitioning the input into a small number of easy-to-handle subsets (runs and trapezoids) but the partitions are very different.

This naturally suggests the idea of defining a more general kind of entropy that would allow both kinds of easy-to-handle subsets to be mixed in a single partition. This is the main idea of my paper “Instance-optimal computational geometry made easier and sensitive to sortedness” with Mike Goodrich, Abraham Illickan, and Claire To, from CCCG last summer. We define something that we call “range-partition entropy” for geometric problems more generally, but for 2d convex hulls it boils down to the following: we partition the input into subsets that are either triangles below the convex hull (like the structural entropy of Afshani et al.) or a sorted subsequence of the input (like the runs of a sorting algorithm used in preparation for Graham scan), compute the entropy from this partition, and provide convex hull algorithms whose time is \(O\bigl(n(\mathrm{H}+1)\bigr)\) for this entropy. But there is an annoying side condition: the runs also have to be contained in triangles, and these triangles must be disjoint from the triangles of other runs and from the triangles below the hull. (The below-the-hull triangles are allowed to cross each other, as before.) It would be easy to substitute other definitions of what it means for a subset to be easy to solve into our algorithms, but it seems more difficult to get rid of this disjointness condition. And although this disjointness condition is a natural fit to the quickhull-like divide-and-conquer design of our algorithms, I don’t know of a good way to justify it through probabilistic or information-theoretic entropy. So for now, the entropy from our paper is an entropy only in the form of its formula.

(Discuss on Mastodon)