Do you have a software project in which you need a fast and space-efficient approximate set data structure, like a Bloom filter? Then probably what you want is actually a cuckoo filter, a plug-in replacement for Bloom filters that is faster, more space-efficient, and more versatile (because it allows elements to be deleted as well as inserted).

Michael Mitzenmacher has described cuckoo filters in an earlier blog post (as well as of course in the published paper about them) but the basic idea is to use a cuckoo hash table cut down in size by storing only a short fingerprint of each key rather than a whole key-value pair. As in a normal cuckoo hash table, keys (or rather their fingerprints) get moved around to make room for other keys, and that leads to a small complication: when you're moving a fingerprint, you don't know which key it came from, so the location to move it to needs to be computable based only on where it is now and on its value. More specifically, the other location for any fingerprint ends up being the xor of its current location with a hash of its value.

Although cuckoo filters have been implemented (see first link) and work well in practice, one drawback is that we didn't know whether they also work well in theory. Conversely, an earlier data structure of Pagh, Pagh, and Rao (SODA 2005) has all the same advantages of cuckoo filters over Bloom filters, but for it, as far as I am aware, there was no implementation, only theory. In contrast, Bloom filters work both actually and theoretically: there is no major gap between theory and practice.

So my new paper, "Cuckoo filter: simplification and analysis" (arXiv:1604.06067, to appear at SWAT) is aimed at closing this gap, although it does not fully do so. What it does is to show that, if you omit the hash of the fingerprint and instead move each fingerprint to the xor of its current location and its value, then cuckoo filtering works well in theory. This simplification causes the filter to be partitioned into many small sub-filters, which operate independently of each other, with each key being assigned randomly to one of them. The main ideas of the paper are that this assigment of keys to sub-filters is very unlikely to be unbalanced and that, within each sub-filter, the data structure behaves just like a cuckoo hash, without any restriction on which pairs of cells the keys can be mapped to. It uses Chernoff bounds to prove the balancing part (as you do), and then just plugs in the existing analysis of cuckoo hashing for the rest.

Of course, it would be better to prove that the actual cuckoo filter works well than this simplification, since I don't think the simplification is likely to be a practical improvement. The graph whose edges connect pairs of cells where a single fingerprint can go has a lot of nice structure and symmetry (usually either a Cayley graph or a disjoint union of hypercubes, compared to a disjoint union of cliques for the simplification), suggesting that spectral graph theory might be helpful, but that's not my forté. Also, my analysis only holds for random hash functions, so it would be good to extend it to realistic methods such as tabulation hashing. Tabulation hashing is known to work for cuckoo hashing, but to get my analysis to work it needs to be extended to blocked cuckoo hashing, a variation that allows multiple keys per cell. Ideally, an analysis of the full cuckoo filter algorithm with a realistic hash function would be best.

But you don't need to wait for that analysis to happen to go start using these things in your code: they already work now. We just don't completely understand why they work.


Out of curiosity, are there similar sorts of things that do work well in practice but not in theory?
Good question. I can't think of any offhand.
Levit's algorithm for shortest paths in a graph