Mutual nearest neighbors versus closest pairs

In the 1990s I published a series of papers on data structures for closest pairs. As long as you already know how to maintain dynamic sets of objects of some type, and answer nearest-neighbor queries among them, you can also keep track of the closest pair, and this can be used as a subroutine in many other computational geometry algorithms. But it turns out that many of those algorithms can now be simplified and sped up by using mutual nearest neighbors (pairs of objects that are each other’s nearest neighbors) instead of closest pairs.

My original motivation for studying these types of problems was to maintain minimum spanning trees of dynamic point sets, using closest red-blue pairs of Euclidean points,¹ ² and I later found more applications in hierarchical clustering, greedy matching, traveling salesperson heuristics,³ ⁴ and (with Jeff Erickson) motorcycle graphs and straight skeletons.⁵ But to use these closest pair data structures, you have to pay two logarithmic factors in time complexity over the time for the underlying nearest-neighbor data structure. So they’re not competitive with (uncolored) Euclidean closest pair data structures, which take only logarithmic time in any fixed dimension. Instead they make more sense to use with other distances than Euclidean, with objects more complicated than single points, or with variations like the red-blue closest pair for which the logarithmic-time solution doesn’t work.

For several variations of hierarchical clustering, an alternative and simpler technique has been known for quite a bit longer, based on finding mutual nearest neighbors (pairs of objects that are nearer to each other than to anything else) rather than closest pairs.⁶ ⁷ It’s called the nearest neighbor chain algorithm, but really it’s a data structure rather than an algorithm, one that allows you to maintain a dynamic point set and find pairs of mutual nearest neighbors, again based on calls to an underlying nearest neighbor data structure. The idea is to maintain a stack of shorter and shorter pairs of nearest neighbors, until the two objects whose distance is on the top of the stack have nothing nearer – they are mutual nearest neighbors. Whenever you want a pair of neighbors, you look at the top pair, an object \(x\) and its nearest neighbor \(y\), and ask whether \(y\)’s nearest neighbor is \(x\). If so, you have found a mutual nearest neighbor pair, and if not you have a new shorter distance to push onto the stack.

One can this in a hierarchical clustering algorithm that repeatedly finds and merges the nearest two clusters, whenever the distance between clusters has a special property: a merged cluster is never closer to other clusters than the closer of the two clusters that was merged. This property implies both that the stack of distances remains valid after the merge, and that mutual nearest neighbors are always safe to merge. If two clusters are mutual nearest neighbors, then the closest-pair clustering algorithm will eventually merge them, because none of its actions can cause them to stop being mutual nearest neighbors. So we might as well merge them immediately once we discover them to be mutual nearest neighbors. (One way to formulate this mathematically is that the set of mutual nearest neighbor pairs merged by the clustering algorithm forms an antimatroid). When this works, you get a clustering algorithm that uses a linear number of nearest neighbor queries, instead of the \(O(n\log^2 n)\) queries that you would get using my closest-pair data structures.

In more recent research with UCI student Nil Mamano (finishing his doctorate this year; hire him for a postdoc, he’s good!) we noticed that the nearest neighbor chain algorithm can also be applied to certain stable marriage problems with preferences coming from geometric distances.⁸ Our latest preprint, “Euclidean TSP, Motorcycle Graphs, and Other New Applications of Nearest-Neighbor Chains” (with Efrat, Frishberg, Goodrich, Kobourov, Mamano, Matias, and Polishchuk, arXiv:1902.06875) extends this to a much broader set of applications. As well as simplifying and speeding up my previous work on motorcycle graphs and TSP heuristics, we also use nearest neighbor chains in a bigger class of stable matching problems and in an approximate geometric set cover problem. In each case, we need to show either that the problem has an antimatroid-like property (so using mutual nearest neighbors produces the same solution as closest pairs) or that, even when varying from the same solution, it achieves the same quality. It’s not quite true that anything closest pairs can do, mutual nearest neighbors can do better, but it’s close.

Another idea in the paper is that to find (exact!) mutual nearest neighbor pairs one can sometimes get away with using approximate near neighbor structures. This is important if you’re using Euclidean distance, because the time bounds for exact nearest neighbors have the form \(n^{1-\varepsilon_d}\) for constants \(\varepsilon_d\) that get very small as \(d\) gets large, while approximate nearest neighbors are logarithmic in all dimensions. The idea is to build the stack of shorter distances by asking for a constant number of approximate near neighbors, the \(i\)th of which is within a constant factor of the distance to the actual \(i\)th nearest neighbor. By a packing argument for points in Euclidean space, either some two of these points are closer to each other than the distance on the current stack top (in which case you can build the stack one more level) or these approximate neighbors are guaranteed to contain the actual nearest neighbor (in which case you can either detect a mutual nearest neighbor pair or again build the stack). This idea leads, for instance, to an algorithm for the multi-fragment TSP heuristic that takes time \(O(n\log n)\) in Euclidean spaces of any bounded dimension; the best previous time appears to be an \(O(n^2)\)-time algorithm (valid in any metric space) from one of my previous papers.³

Agarwal, P. K., Eppstein, D., and Matoušek, J., “Dynamic algorithms for half-space reporting, proximity problems, and geometric minimum spanning trees”, FOCS, 1992, pp. 80–89. ↩
Eppstein, D., “Dynamic Euclidean minimum spanning trees and extrema of binary functions”, Discrete Comput. Geom. 13: 111–122, 1995. ↩
Eppstein, D., “Fast hierarchical clustering and other applications of dynamic closest pairs”, SODA, 1998, pp. 619–628, arXiv:cs.DS/9912014, J. Experimental Algorithmics 5 (1): 1–23, 2000. ↩ ↩²
Cardinal, J., and Eppstein, D., “Lazy algorithms for dynamic closest pair with arbitrary distance measures”, ALENEX, 2004, pp. 112–119. ↩
Eppstein, D., and Erickson, J., “Raising roofs, crashing cycles, and playing pool: applications of a data structure for finding pairwise interactions”, SoCG, 1998, pp. 58–67, Discrete Comput. Geom. 22 (4): 569–592, 1999. ↩
Benzécri, J.-P. (1982), “Construction d’une classification ascendante hiérarchique par la recherche en chaîne des voisins réciproques”, Les Cahiers de l’Analyse des Données, 7 (2): 209–218. ↩
Juan, J. (1982), “Programme de classification hiérarchique par l’algorithme de la recherche en chaîne des voisins réciproques”, Les Cahiers de l’Analyse des Données, 7 (2): 219–225. ↩
Eppstein, D., Goodrich, M. T., and Mamano, N., “Algorithms for stable matching and clustering in a grid”, arXiv:1704.02303, IWCIA 2017, LNCS 10256 (2017), pp. 117–131. ↩

(Discuss on Mastodon)