van Emde Boas priority queue implementation
Just in time for my departure for ALENEX and SODA, a post about algorithm implementation!
This year in my graduate data structures class I've moved my coverage of van Emde Boas trees earlier, to part of a group of lectures on priority queues rather than later when I cover the successor problem (binary search trees). vEB trees are a little simpler when they only have to be used for prioritization, but I also think they're more likely to be useful in that application, simply because priority queues and dictionaries are more important than successors.
Anyway, in order to test my own understanding of the subject, I wrote a Python implementation which I'm including in my PADS Python algorithms library.
In Python, built-in types such as sets and dictionaries can be much faster than user-defined types because they run in native code rather than interpreted (insert rant about Snow Leopard breaking psyco here), so it turns out that for 32-bit keys my implementation needs ~256 items in the heap before it beats a naive priority queue that maintains everything in a set object and uses a linear time min(set) every time it wants to find the minimum key. My vEB tree implementation switches to a bitvector based set data structure at its lower levels, for keys that are very small integers; my testing found that using 256-bit bitvectors for 8-bit keys is a slight win over using 16-bit bitvectors for 4-bit keys.
While I was adding stuff to PADS I also fixed a deficiency in SortedSet that Mark Lawrence pointed out: the Python set object can be initialized with any iterable collection of objects, but SortedSet couldn't. Now it can. This is an incompatible change to the API as it means that the comparison function (if any, and if not passed as a named parameter) has to come after the iterable, but it didn't break any of my other code that uses this module and I'd be somewhat surprised if it broke anyone else's.
Comments:
2010-01-15T15:05:30Z
I also think they're more likely to be useful in that application, simply because priority queues and dictionaries are more important than successors.
Ahem. Predecessor search is the problem that must be solved for every packet arriving at an Internet router. So overall, it's used far more often than priority queues (and there is considerable research on it in the networking community.)
Of course, from the standpoint of a Python library, you are probably right that the vEB implementation will not get much use.
2010-01-15T15:05:53Z
I forgot to sign. -mihai
2010-01-15T15:52:03Z
[citation needed].
It is true that every packet needs to be looked up in a data structure by every router it goes through to determine how to route it. And that the data to be looked up is a small integer (a 32-bit destination IP address). I wasn't aware that the sata structural problem to be solved in those lookups was predecessor search, though. I thought it was more of an interval search, the sort of thing that one could solve by a segment tree or interval tree (though one wants faster structures that use the integer assumption).