Invertible Bloom antimatroids

I’ve written here many times about ready lists, the familiar programming pattern in which you collect actions to be performed and pull them off the list one at a time to perform them, and about antimatroids, the algebraic structure generated by the orderings of events that you can get in this way. Here’s another example that I didn’t notice until recently, despite it being prominent in my own publications: the peeling sequences of invertible Bloom filters and invertible Bloom lookup tables.

An invertible Bloom filter maintains a set or multiset of keys, subject to dynamic insertions and removals, in a fixed amount of memory proportional to its capacity, the number of keys it can store. You can insert more keys than its capacity, and then delete them again; as long as the deletions bring it back below the capacity, you can recover all the remaining keys in the set (with high probability). An invertible Bloom lookup table differs from the filter version of the data structure only in that each key is associated with a value (which must be known both when it is inserted and when it is deleted) and these values can be recovered with the keys. These data structures come from the following publications:

D. Eppstein, and M. T. Goodrich (2011), “Space-efficient straggler identification in round-trip data streams via Newton’s identities and invertible Bloom filters”, IEEE Trans. Knowledge and Data Engineering 23 (2): 297–306, arXiv:0704.3313. (A preliminary version of this was presented at WADS 2007.)
D. Eppstein, M. T. Goodrich, F. Uyeda, and G. Varghese (2011), “What’s the difference? Efficient set reconciliation without prior context”, Proc. ACM SIGCOMM, pp. 218–229.
M. T. Goodrich and M. Mitzenmacher (2011), “Invertible Bloom lookup tables”, Proc. 49th Allerton Conf., pp. 792–799, arXiv:1101.2245.

The data structure consists of a collection of cells (a constant factor more of them than its capacity). Each key or key–value pair is stored in a constant number of cells; a hash function maps each key to its cells. Each cell stores between two and four pieces of information: in the most basic version, these are the number of keys stored in that cell and the sum of the keys stored in it. Other versions add a sum of checksums of the keys stored in it, and a sum of values stored in it, all maintained using modular arithmetic so that they will remain accurate and not overflow even for a cell storing many keys. To insert a key we simply add it to each of the cells the hash function maps it to. Deleting a key is the same, but subtracting rather than adding.

A cell is called “pure” if it holds only one key (possibly multiple times). For sets this can be detected by checking that the count value of the cell equals one. Variants that allow multisets and use checksums can instead divide the cell’s sum of keys by its count (using arithmetic modulo a large prime) to give the key that would be stored there if it were indeed pure, and then check whether this matches the cell’s sum of checksums. To recover all keys, repeatedly find a pure cell and delete its key, until all cells are empty. When the number of keys is below the capacity, this peeling process works with high probability.

Here’s where the ready list / antimatroid idea comes in. A pure cell stays pure until its key is peeled. So we can implement the peeling process by maintaining a ready list of keys in pure cells, and then repeatedly peeling one of those keys. When we peel a key, we check whether this creates any new pure cells, and if so add their keys to the ready list (if they were not already there). Because a key is added to the ready list based only on which other keys have already been peeled (and not on the order in which they were peeled), and stays there until it is peeled itself, the orders in which keys can be peeled define an antimatroid.

An immediate consequence, easy to prove directly for this process but true for all antimatroids, is that it doesn’t matter what order you peel the keys in. If there is a peeling order that works to peel all keys, then every peeling order works. If you get stuck with some subset of keys peeled, then every peeling order will get stuck after exactly the same subset.

Not every antimatroid can be generated in this way. The invertible Bloom antimatroids have the following special property: the condition for a key to become ready for peeling is a disjunction (a boolean and) of a constant number of conjunctions (boolean ors). In terms of the data structure, a key is ready to peel when one of the constant number of cells that it is mapped to becomes pure, and this happens when all of the other keys in the same cell have already been peeled. For the same reason, each key can free up at most a constant number of other keys when it is peeled. Another special property of an invertible Bloom antimatroid is that, when it is completely decodable, you can choose the last two keys in either order. They both map to the same number of cells, so they both have the same number of pure cells and the same number of cells that they share with the other key. If no other keys are left, and one of the two has a pure cell, so does the other one.

(Discuss on Mastodon)