Someone recently added this new Wikipedia article on the fixed-radius near-neighbor search problem. I cleaned it up a little (it obviously needs a lot more work); in the process of the cleanup I changed the definition of the problem to better match the one in the Bentley reference. But before I got to it, the definition was that one should report all pairs with distance at most Δ in a given point set.
It occurred to me that the all-pairs fixed radius problem of the original version of the article has a simple solution with time linear in the input and output size (for any fixed dimension, assuming constant time integer truncation and hash table operations): use a hash table to group the points into cubical buckets with side length Δ, and for each point look at all potential neighbors in adjacent buckets. If some bucket contains many points, you might spend a lot of time examining pairs of points that are too far apart, but in that case there are many nearby close pairs against which the time can be charged: both the time and the number of reported pairs are proportional to the sum of the square of the bucket sizes.
This must have been known, probably in the 1960s or 1970s; it's a close relative of several linear-time randomized bucketing closest-pair algorithms, but even simpler. Bentley's 1975 survey mentions this sort of approach to nearest neighbor problems under the heading of "cell techniques", but without the hashing, without any analysis, and with a claim that these techniques require uniformly distributed points. Anyone know a reference for the linear time analysis?
In contrast, by the way, there can be no linear-time bucketing algorithm for finding the nearest neighbor of every point unless one is using a model of computation that can sort in linear time. The reason is that, if one can find all nearest neighbors, one can use something like Borůvka's algorithm to sort in one dimension: the nearest neighbor graph forms a set of non-overlapping paths, and one may replace each path by a single representative point and continue recursively.
but David Mount might know, since he talks about this in the second of his CG lectures (http://www.cs.umd.edu/~mount/754/Lects/754lects.pdf)...
Thanks! It may not be the reference, but at least it's a reference. Before posting this, I did find a reference the 1977 IPL paper he mentions, but I didn't find an online copy to look at; maybe it's in there.
Posts like this make me very happy I'm reading your blog. I have exactly this problem in a graphics program I'm writing. I had briefly considered the cell-based approach, gone "oh, but then _obviously_ there's no lower bound if they are not uniformly distributed", and instead started coding an elaborate Kd-tree based scheme. Which I can now throw away -- yay!
kD-trees can handle a wider class of problems quickly, though. So your code could still very likely be useful.
This problem is related to LSH. In particular, an approximate NN query can be reduced to logarithmic number of queries in such a data structure with different radiuses.
...and one can store enough different versions of this data structure to handle the approximate NN queries, implicitly in linear total space, as a compressed quadtree.