Jekyll2017-05-21T22:02:33+00:00https://11011110.github.io/blog/11011110Geometry, graphs, algorithms, and moreDavid EppsteinBit tricks with cuckoo filters2017-05-21T14:48:00+00:002017-05-21T14:48:00+00:00https://11011110.github.io/blog/2017/05/21/bit-tricks-cuckoo<p>I was pleased to discover that the proceedings of <a href="http://sigmod2017.org/sigmod-accepted-papers/">SIGMOD</a> and <a href="http://sigmod2017.org/pods-accepted-papers/">PODS</a> have been made open-access through their conference websites and the ACM “authorize” feature. It’s not as good as making the official publisher page for the papers open access — you can’t link to individual papers directly this way — but still pretty good.</p>
<p>One of the PODS papers is mine, my first at PODS. Go to the PODS link above and look for the link to “2-3 Cuckoo Filters for Faster Triangle Listing and Set Intersection” (with Mike Goodrich, Michael Mitzenmacher, and Manny Torres). We intend to make an arXiv version eventually, but as a merger with some other stuff including a related brief announcement at SPAA with Goodrich, “Using Multi-Level Parallelism and 2-3 Cuckoo Filters for Faster Set Intersection Queries and Sparse Boolean Matrix Multiplication”, so for now only the PODS version is available.</p>
<p>As the title(s) imply, this is another application of <a href="/blog/2016/04/20/cuckoo-filters-and.html">cuckoo filters, the subject of my paper from SWAT last year</a>. As you may already know, a cuckoo hash table has a constant number of locations for each of its key-value pairs, and moves keys from one location to another if necessary to make room for new keys. In one particularly space-efficient version of this, each key has exactly two locations it might be found at, but each location can hold more than one key (but a constant number of keys). The more locations per key there are, the closer to 100% full the data structure can be made. A cuckoo filter is the same, but stores short fingerprints of each key instead of the whole key or a key-value pair. It provides an approximate set membership data structure (like a Bloom filter), where you can test membership by checking whether the fingerprint for a key is in one of the locations where it should be. No-answers are always correct, but a yes-answer might be incorrect (some other key may have left the same fingerprint in the same place) with a small and controllable false positive rate.</p>
<p>The main insight in the new paper is that, when the fingerprints are small enough, you can pack several of them into a binary word, and this packing allows you to compare two cuckoo filters and find the positions where they have the same fingerprint in the same place. With the two-location cuckoo filters, this wouldn’t give any useful information (the same key might be stored in both filters, but in different locations, so you wouldn’t notice the overlap). But a variant of cuckoo hashing called 2-3 cuckoo hashing, where each key has three possible locations and is stored in two of them, can be used to ensure that the two cuckoo filters we’re comparing do always have overlapping locations for each shared key. (This variant for cuckoo hashing, not filtering, was introduced as the “batmap” by Amossen and Pagh, “<a href="https://arxiv.org/abs/1102.1003">A new data layout for set intersection on GPUs</a>”, IPDPS 2011.)</p>
<p>Based on this idea, we can find the intersection of two sets of cardinality <script type="math/tex">d</script>, represented as cuckoo hash tables with corresponding cuckoo hash filters, on a machine of word size <script type="math/tex">w</script> and with intersection size <script type="math/tex">k</script>, in time <script type="math/tex">O(k+(d\log w)/w)</script>, giving a bit-parallel speedup of <script type="math/tex">(\log w)/w</script>.
The same speedup applies to other related problems such as finding or listing triangles in <script type="math/tex">d</script>-degenerate graphs. The idea is to use bit-parallel programming techniques to find matching fingerprints in the cuckoo filter, and then to check those positions for whether they are an actual match using the cuckoo hash table. The <script type="math/tex">\log w</script> factor in the runtime comes from choosing the fingerprint size appropriately, to make the false positive rate low enough so that the contribution to the runtime from checking false positives is dominated by the other parts of the algorithm. This result shaves a <script type="math/tex">\log w</script> time factor over a previous bit-parallel set intersection algorithm of Kopelowitz, Petie, and Porat, “<a href="https://arxiv.org/abs/1407.6755">Dynamic set intersection</a>”, WADS 2015, which used other ideas (bit-parallel sorting of lists of fingerprints rather than cuckoo filters). Since <script type="math/tex">w</script> is itself likely to be logarithmic in the input size, really this is shaving a <script type="math/tex">\log\log</script> factor from the running time. Nevertheless, we also ran some experiments showing that this technique can lead to practical improvements on triangle-finding for some (but not all) graphs.</p>
<p>After this paper was published, I heard from Mike Rosulek at Oregon State (where I was visiting last week) of a paper extending the 2-3 cuckoo hashing idea in a different but related direction, for private set intersection. I think the reference is “Linear size circuit-based PSI via two-dimensional cuckoo hashing” by Benny Pinkas, Thomas Schneider, Christian Weinert, and Udi Wieder, a not-yet-published manuscript that Pinkas described in <a href="http://www.cs.bris.ac.uk/Research/CryptographySecurity/TPMPC/Slides2017/BennyPinkas.pdf">a talk last April at TPMPC 2017</a>. The idea is to store each fingerprint or key in two out of four locations, instead of two out of three. Suppose set <script type="math/tex">X</script> stores its fingerprints either in north and east or in south and west, while set <script type="math/tex">Y</script> stores its fingerprints either in north and west or in south and east. Then, when <script type="math/tex">X</script> and <script type="math/tex">Y</script> have a key in common, it is guaranteed to appear exactly once in a shared location, whereas 2-3 cuckoo hashing would allow a common key to appear either once or twice in shared locations. The fact that each key in the intersection of the two sets is found only once in the intersection algorithm helps simplify the cryptographic parts of the private set intersection.</p>
<p>(<a href="https://plus.google.com/100003628603413742554/posts/7CB7KYvRsxj">Google+ posting and discussion thread</a>)</p>David EppsteinI was pleased to discover that the proceedings of SIGMOD and PODS have been made open-access through their conference websites and the ACM “authorize” feature. It’s not as good as making the official publisher page for the papers open access — you can’t link to individual papers directly this way — but still pretty good.Linkage2017-05-15T22:18:00+00:002017-05-15T22:18:00+00:00https://11011110.github.io/blog/2017/05/15/linkage<ul>
<li>
<p><a href="http://www.scotusblog.com/2017/04/argument-analysis-concerns-prosecutorial-discretion-likely-lead-ruling-bosnian-serb-immigration-case/">Trump administration claims the right to de-citizen people over minor mis-statements in their naturalization papers</a> (<a href="https://plus.google.com/100003628603413742554/posts/NcUwn6zi9YH">G+</a>). As a naturalized citizen myself, this sort of thing makes me nervous.</p>
</li>
<li>
<p><a href="http://boingboing.net/2017/05/02/french-far-right-leader-le-pen.html">Marine Le Pen is a plagiarist</a> (<a href="https://plus.google.com/100003628603413742554/posts/iGomVWj9BMN">G+</a>). Meanwhile our own plagiarist-in-high-places and actual Nazi, Sebastian Gorka, maintains his office as one of Trump’s top advisors.</p>
</li>
<li>
<p><a href="https://thmatters.wordpress.com/2017/05/02/tcs-wikipedia-project/">The TCS Wikipedia project</a> (<a href="https://plus.google.com/100003628603413742554/posts/SJ4kL8wj8sy">G+</a>), aiming to identify and fix shortcomings in Wikipedia’s coverage of theoretical computer science topics.</p>
</li>
<li>
<p><a href="http://www.nasonline.org/news-and-multimedia/news/may-2-2017-NAS-Election.html">This year’s new National Academy of Sciences members</a> (<a href="https://plus.google.com/100003628603413742554/posts/J2FqMXec1rc">G+</a>) include theoretical computer scientists Dan Spielman and Madhu Sudan. Congratulations!</p>
</li>
<li>
<p><a href="https://www.youtube.com/watch?v=B5p2A5mazEs">Creating the never-ending bloom</a> (<a href="https://plus.google.com/100003628603413742554/posts/hMR2tp5Agqx">G+</a>), a making-of video for John Edmark’s 3d mathematical sculptures which, when rotated at the right speed and stroboscopically illuminated, appear to grow and bloom much like the tip of a plant stem.</p>
</li>
<li>
<p><a href="https://www.youtube.com/watch?v=x9bGM9Xke8g">How to tie water in a knot</a> (<a href="https://plus.google.com/100003628603413742554/posts/1rTqNjs1JnB">G+</a>). 3d-printed hydrofoils create knotted vortices, which then twist around themselves and uncross.</p>
</li>
<li>
<p><a href="http://starcage.org/fibonacci_puzzle/fibonacci_puzzle.html">Fibonacci jigsaw puzzle</a> (<a href="https://plus.google.com/100003628603413742554/posts/GhUFT6QhZ3k">G+</a>) based on the spiral patterns of sunflowers and other plants, can also be reassembled to have a missing piece or an extra piece.</p>
</li>
<li>
<p><a href="https://math.stackexchange.com/questions/2273108/checking-if-a-polygon-is-contained-in-another-polygon">Checking whether one polygon is contained within another</a> (<a href="https://plus.google.com/100003628603413742554/posts/Banjm5gGuBN">G+</a>) is trickier than it looks. It doesn’t work to simply test containment at the vertices; you have to look for the edge crossings. But a (complicated) linear time algorithm is possible.</p>
</li>
<li>
<p><a href="http://www.metafilter.com/166853/Purdue-to-Kaplan-Id-buy-that-for-a-dollar">Roundup of links on the Purdue–Kaplan merger of public and corporate education</a> (<a href="https://plus.google.com/100003628603413742554/posts/QCcWyJFHW7Y">G+</a>) and <a href="https://www.insidehighered.com/blogs/world-view/purdue%E2%80%99s-massive-blunder">an editorial calling it a massive blunder</a>.</p>
</li>
<li>
<p><a href="http://mathjax-shrinker.christianperfect.com/">MathJax shrink-o-matic</a> (<a href="https://plus.google.com/100003628603413742554/posts/AggZKLpJ2mX">G+</a>) helps you choose the good parts version of a much longer work.</p>
</li>
<li>
<p><a href="https://www.youtube.com/watch?v=O3RsDIWB7s0">Steve Mould and Matt Parker describe the different types of crystal defects</a> (<a href="https://plus.google.com/100003628603413742554/posts/ZgB2rXgnVra">G+</a>) with some help from ball bearings and ball-pit balls.</p>
</li>
<li>
<p><a href="https://blogs.scientificamerican.com/roots-of-unity/math-under-my-feet/">Octagonal paving tiles</a> (<a href="https://plus.google.com/100003628603413742554/posts/2ci1TdSc9Wh">G+</a>) can only work if they’re non-convex and have some vertices where only two tiles meet, but are actually used in some places. I’ve also seen decagonal tiles (in the shape of a convex octagon with a square glued to one side).</p>
</li>
<li>
<p><a href="http://www.joshmillard.com/2016/06/08/painting-math-on-the-wall/">Josh Millard decorates his home office with artfully painted fractals</a> (<a href="https://plus.google.com/100003628603413742554/posts/GsXT6369j7W">G+</a>).</p>
</li>
<li>
<p><a href="https://gavialib.com/2017/02/who-knows-whose-journals/">Why telling good journals from bad ones in fields other than your own is not always easy</a> (<a href="https://plus.google.com/100003628603413742554/posts/RvkNB9nfk1b">G+</a>) and why bad-journal lists such as Beall’s may be necessary.</p>
</li>
</ul>David EppsteinTrump administration claims the right to de-citizen people over minor mis-statements in their naturalization papers (G+). As a naturalized citizen myself, this sort of thing makes me nervous.Roundup of academic-publisher misconduct2017-05-06T15:28:00+00:002017-05-06T15:28:00+00:00https://11011110.github.io/blog/2017/05/06/roundup-academic-publisher<p>I had a bunch of bookmarks saved on academic-publisher misconduct and the brewing revolt of academic libraries against these publishers, intending to post them as links on Google+, but they just kept coming too fast, so instead I’m posting a roundup of them here.</p>
<ul>
<li>
<p><a href="http://onsnetwork.org/chartgerink/2017/01/13/false-claims-of-copyright-and-stm/">False claims of copyright by Elsevier</a>. Elsevier continues to fraudulently claim copyright on public-domain journal articles from the 19th century. The response from their industry association, the International Association of STM Publishers, makes clear that they have no intention of changing their misbehavior voluntarily. Via <a href="https://twitter.com/Senficon/status/857126517385695232">Julia Reda</a> and <a href="https://plus.google.com/+DavidRoberts/posts/6k4tMUDyB1T">David Roberts</a>.</p>
</li>
<li>
<p><a href="https://scholarlykitchen.sspnet.org/2017/05/01/wolf-finally-arrives-big-deal-cancelations-north-american-libraries/">A rapidly-growing set of university libraries are cancelling package deals with major publishers</a> and, more importantly, sticking to their cancellation afterwards. “Relatively few libraries that actually do cancel their Big Deals end up regretting it.” A lot of the cases involve Wiley rather than our favorite target, Elsevier, and several of them involve academic societies rather than for-profit corporations. Via <a href="https://plus.google.com/+TimothyGowers0/posts/GmLVspML39E">Timothy Gowers</a>.</p>
</li>
<li>
<p><a href="https://www.universiteitleiden.nl/en/news/2017/04/geen-akkoord-vnsu-eng-extern">The Dutch appear poised to cancel their national package deal with Oxford University Press</a> after OUP refused to work towards making all Dutch-authored papers open-access and to avoid double-dipping for open-access publications (making authors pay to make their papers open access and then making libraries pay again to access the papers). This appears to affect primarily medical journals. Via <a href="https://plus.google.com/+DavidRoberts/posts/XgMXip61ZCw">David Roberts</a>.</p>
</li>
<li>
<p><a href="https://www.insidehighered.com/news/2017/05/03/louisiana-state-takes-disagreement-elsevier-court">Louisiana State University sues Elsevier</a> after LSU’s School of Veterinary Medicine Library cancelled their redundant deal for Elsevier journals (also covered by a blanket deal for the whole campus) and Elsevier stopped allowing access to the school despite the blanket deal. An odd twist to the story is that <a href="http://policynotes.arl.org/?p=1537">Elsevier refuses to accept service for the lawsuit</a>: Despite doing plenty of business in the US, and taking plenty of legal action there on their own behalf, they have been taking the position of not being a US business when it comes to receiving lawsuits, and have been forcing LSU to go after them through international law. Via <a href="https://plus.google.com/+DavidRoberts/posts/AaRs3N4vcQX">David Roberts</a> and <a href="https://plus.google.com/+TimothyGowers0/posts/5jQrwpHEyQf">Timothy Gowers</a>.</p>
</li>
<li>
<p><a href="http://www.sciencedirect.com/science/article/pii/S0306452217302634">Many predatory journals have been successful in getting indexed by PubMed</a> according to a new study (posted behind a paywall in an Elsevier journal). Beall’s list of predatory journals was used to identify these journals, and is now sorely missed. Via <a href="http://retractionwatch.com/2017/04/22/weekend-reads-culture-fear-blogs-vs-academic-papers-neurosurgery-retractions-rise/">Retractionwatch</a>.</p>
</li>
</ul>
<p><a href="https://plus.google.com/100003628603413742554/posts/4AgEJpiauaX">(Google+ discussion thread for this post)</a></p>David EppsteinI had a bunch of bookmarks saved on academic-publisher misconduct and the brewing revolt of academic libraries against these publishers, intending to post them as links on Google+, but they just kept coming too fast, so instead I’m posting a roundup of them here.Linkage2017-04-30T22:08:00+00:002017-04-30T22:08:00+00:00https://11011110.github.io/blog/2017/04/30/linkage<ul>
<li>
<p><a href="https://cses.fi/book.html">Competitive Programmer’s Handbook</a> (<a href="https://plus.google.com/100003628603413742554/posts/G57MLr5C8zE">G+</a>). Although it is aimed at participants in programming competitions this new free-online book has most of the same content as an undergraduate algorithms text.</p>
</li>
<li>
<p><a href="http://www.owenschuh.com/">Owen Schuh’s mathematical art</a> (<a href="https://plus.google.com/100003628603413742554/posts/5dZTzwUSm5r">G+</a>). The G+ post links to <a href="http://www.owenschuh.com/albums/work/content/detail-disturbance/">a detail from “Disturbance”</a>, an interlocked set of white arcs on a black field that look much like a Mark Lombardi drawing.</p>
</li>
<li>
<p><a href="https://en.wikipedia.org/wiki/Harry_R._Lewis">Harry R. Lewis</a> (<a href="https://plus.google.com/100003628603413742554/posts/1SuBJpo32Z8">G+</a>) gets a Wikipedia Good Article about him for his 70th birthday.</p>
</li>
<li>
<p><a href="http://www.scottaaronson.com/blog/?p=3221">Scott Aaronson rants about calendars and time zones</a> (<a href="https://plus.google.com/100003628603413742554/posts/Fy9FK4koM9g">G+</a>). The default time zone needs to be “wherever I will be when this event happens”, not “where I was when I put this on my calendar”.</p>
</li>
<li>
<p><a href="https://laughingsquid.com/incredible-underwater-sherwin-williams-paint-commercial-made-without-the-use-of-cgi/">Paint commercial</a> (<a href="https://plus.google.com/100003628603413742554/posts/6ZCfvJFSyVL">G+</a>) with beautiful slow-mo footage of paint streams interacting in water, and no CGI.</p>
</li>
<li>
<p><a href="https://mathoverflow.net/a/267401/440">The strip pentiamond joins the ranks of the reptiles</a> (<a href="https://plus.google.com/100003628603413742554/posts/Vk6jtVTHaR5">G+</a>), in answer to a question on MathOverflow.</p>
</li>
<li>
<p><a href="https://www.quantamagazine.org/20170411-equiangular-lines-proof/">A New Path to Equal-Angle Lines</a> (<a href="https://plus.google.com/100003628603413742554/posts/JUnkYw5Zsrs">G+</a>) <em>Quanta</em> reports on <a href="https://arxiv.org/abs/1606.06620">a preprint of Balla et al</a> showing that for any fixed angle <script type="math/tex">\theta</script>, in high enough dimensions, at most a linear number of lines through the origin can all form angles of <script type="math/tex">\theta</script> with each other. See also <a href="http://www.combinatorics.org/ojs/index.php/eljc/article/view/v7i1r55">an earlier paper</a> showing that if <script type="math/tex">\theta</script> can vary with dimension then quadratically many equiangular lines are possible.</p>
</li>
<li>
<p><a href="https://arxiv.org/abs/1206.2060">A reversible cellular automaton that supports vibrating strings</a> (<a href="https://plus.google.com/100003628603413742554/posts/ChuefgyAZeF">G+</a>). Via Tim Hutton, who posted a nice gif of the automaton in action.</p>
</li>
<li>
<p><a href="http://timewheel.net/amazing-chronophotographs-capture-patterns-birds-flight/">Chronophotographs of birds in flight</a> (<a href="https://plus.google.com/100003628603413742554/posts/Lzw5GxBqscb">G+</a>) by Xavi Bou, showing the time-lapse shapes made by their motion.</p>
</li>
<li>
<p><a href="http://retractionwatch.com/2017/04/26/troubling-new-way-evade-plagiarism-detection-software-tell-used/">Plagiarists are now using automatic paraphrasing software</a> to evade detection (<a href="https://plus.google.com/100003628603413742554/posts/hggyQtyc4nM">G+</a>). Do we need to fight fire with fire by using parameterized string matching algorithms?</p>
</li>
<li>
<p><a href="https://phys.org/news/2017-04-lizard-biology-mathematics.html">A lizard whose scale patterns are generated by a long-term cellular automaton</a> (<a href="https://plus.google.com/100003628603413742554/posts/16orNWRZKUe">G+</a>). The <a href="http://www.nature.com/nature/journal/v544/n7649/full/nature22031.html">original research paper</a> was just published in <em>Nature</em>.</p>
</li>
<li>
<p><a href="http://boingboing.net/2017/04/29/turkey-blocks-wikipedia.html">Turkey blocks Wikipedia</a> (<a href="https://plus.google.com/100003628603413742554/posts/gC9c8HN2ZD7">G+</a>), apparently because Wikipedia refuses to prevent non-Erdogan-supporters from editing.</p>
</li>
</ul>David EppsteinCompetitive Programmer’s Handbook (G+). Although it is aimed at participants in programming competitions this new free-online book has most of the same content as an undergraduate algorithms text.Recognizing serial dictatorships2017-04-29T15:10:00+00:002017-04-29T15:10:00+00:00https://11011110.github.io/blog/2017/04/29/recognizing-serial-dictatorships<p><a href="/blog/2017/04/26/santa-cruz-sorting.html">My previous post</a> concerned fairly assigning students to campus housing based only on the students’ preferences. From a comment by Mark C. Wilson, I learn that this problem has a large literature, and is called either house assignment or one-sided matching. The lottery system I described for solving it (in which the students are randomly ordered and, in that order, choose their preferred house among the ones still available) is called random serial dictatorship. Here, “dictatorship” means that one person chooses the outcome, “serial” means they take turns being that one person, and “random” means that the order in which they take turns is randomized. And the property that I called “stability” (that no subset of students can, after the fact, trade assignments and all improve) is more often called “Pareto optimality”.</p>
<p>The definition of stability (or Pareto optimality) involves arbitrary reassignments among arbitrary subsets of students, so testing it directly would take exponential time. But it turns out to be possible to recognize stable assignments much more easily, based on the fact that the only assignments that can be stable are the ones that could have been chosen by a serial dictatorship.
My previous post explains why serial dictatorships are stable: in every subset of students, the first student to choose gets an unimprovable assignment.</p>
<p>To see that every stable assignment can be constructed in this way, consider an arbitrary assignment <script type="math/tex">A</script> (not necessarily stable) and construct a directed graph <script type="math/tex">G</script> from <script type="math/tex">A</script> as follows. The vertices of <script type="math/tex">G</script> will be the students and the houses, and each edge will connect a student to a house or vice versa, so <script type="math/tex">G</script> will be bipartite. It includes a single outgoing edge from each student, to the house that student is assigned to. However, there may be multiple incoming edges, one from each house that the student would prefer over their actual assignment.</p>
<p>If <script type="math/tex">A</script> happens to have been created by a serial dictatorship, then consider the steps of the serial dictatorship process (in which either a student chooses a house, or a house becomes full) and use the chronological ordering of these steps to place the vertices of <script type="math/tex">G</script> into a sequence. This sequence is necessarily a topological ordering, meaning that each edge of <script type="math/tex">G</script> is directed from earlier in the sequence to later. For, a student cannot choose a house after that house becomes full, so the outgoing edge for each student is properly oriented. And a student cannot pass up a preferred house unless that house is already full, so the incoming edge for each student is properly oriented as well. Thus, when <script type="math/tex">A</script> comes from a serial dictatorship, <script type="math/tex">G</script> is a <a href="https://en.wikipedia.org/wiki/Directed_acyclic_graph">directed acyclic graph</a>.</p>
<p>The reverse is true as well: when <script type="math/tex">G</script> is a directed acyclic graph, its assignment <script type="math/tex">A</script> could have been formed by a serial dictatorship. For, every directed acyclic graph has a topological order. If we construct the serial dictatorship for the student ordering given by a topological order, the students will necessarily choose assignment <script type="math/tex">A</script>. In particular, when <script type="math/tex">G</script> is acyclic, <script type="math/tex">A</script> is stable.</p>
<p>And finally, when <script type="math/tex">G</script> is not acyclic, <script type="math/tex">A</script> is not stable. For, suppose <script type="math/tex">G</script> has a cycle. Then, if we reverse that cycle and consider the new outgoing edges from each student vertex of the cycle, we will find a subset of students (the ones in the cycle) and a reassignment of those students (the new outgoing edges) that improves the assignment of everyone in the subset.</p>
<p>Based on this characterization, we can test whether a given assignment is stable, find a serial dictatorship ordering for it when it is stable, or find an unstable subset of students when it is not, in time linear in the input size (the preference listings of all the students). The problem becomes one of testing whether a given graph is a directed acyclic graph, finding a topological ordering for it, or finding a cycle in it, all of which have standard algorithms taking linear time.</p>
<p><a href="https://plus.google.com/100003628603413742554/posts/7F8J5Cc5CWj">(Google+ discussion thread for this post)</a></p>David EppsteinMy previous post concerned fairly assigning students to campus housing based only on the students’ preferences. From a comment by Mark C. Wilson, I learn that this problem has a large literature, and is called either house assignment or one-sided matching. The lottery system I described for solving it (in which the students are randomly ordered and, in that order, choose their preferred house among the ones still available) is called random serial dictatorship. Here, “dictatorship” means that one person chooses the outcome, “serial” means they take turns being that one person, and “random” means that the order in which they take turns is randomized. And the property that I called “stability” (that no subset of students can, after the fact, trade assignments and all improve) is more often called “Pareto optimality”.The Santa Cruz Sorting Hat2017-04-26T22:35:00+00:002017-04-26T22:35:00+00:00https://11011110.github.io/blog/2017/04/26/santa-cruz-sorting<p>This is the week when high school students across the US finalize their decisions on which colleges and universities to go to in the following year. My son ended up choosing the University of California, Santa Cruz, where he wants to study computer science.
Unlike faculty at many private schools, we get no special discount for sending our kids to the same university, but this is at least significantly less expensive than the other options he was choosing among, mainly similar-caliber state universities in other states. It’s also his mother’s alma mater.</p>
<p>Anyway, Santa Cruz is in certain respects similar to Hogwarts. The students at Hogwarts are divided up into four houses, with different house characteristics, different accommodations, and different course schedules, and it’s the same at Santa Cruz, only there are ten of them and they’re called colleges. They’re located on different parts of campus (some nestled in the redwoods, some with ocean views); the older ones have names and themes, while the two newest ones are merely numbered (“college nine” and “college ten”; the former “college eight” <a href="http://news.ucsc.edu/2016/09/rachel-carson-college.html">became Rachel Carson College</a> last year). The students live there, at least for their first year, they take a college-specific set of freshman general-education courses, and they even graduate together. One of the things my son had to do before he sent his acceptance in was to rank the colleges: which ones did he want to be in, and which not? (I don’t actually know his ranking, but I imagine I’ll find out which one he ends up placed into.)</p>
<p>So it occurred to me to wonder: Since (I assume) Santa Cruz has no magic Sorting Hat that can simultaneously assign each student to a house that would best fit them and balance the sizes of all the houses, how do they do this placement? I don’t know but I can speculate.</p>
<p>Let’s formalize and simplify the problem by assuming that all incoming students provide a ranking of all colleges, that all colleges have an equal number of slots for new students, and that the colleges themselves do not have any preferences for some students over others except in that they would prefer to get students who want to be in those colleges.
(In practice, the rankings are only partial, and I assume that the college sizes are not all exactly equal.) For most inputs of this sort, not all students can get their first choice (it’s not magic), so we need an algorithm to translate this input into a good assignment. But first, what do we mean by “good”? There are several natural criteria we can consider:</p>
<ul>
<li>
<p>In general, we should try to respect the students’ preferences to the extent possible. A minimal requirement of this type is that the ranking should be stable, in that no group of students could all get better assignments by trading amongst each other. So if some students don’t get what they want, it’s not gratuitous: if they did get a better assignment, it would have to make some other student’s assignment worse.</p>
</li>
<li>
<p>The students should be incentivized to be honest, rather than gaming the system: reporting their rankings accurately should always get them a placement (or distribution of placements) at least as good as any other ranking. In particular students who make an unusual choice of ranking should not get an unusually high chance of a good assignment (incentivizing them to falsely report such a ranking) nor an unusually low chance (incentivizing the ones who would choose such a ranking to report something else).</p>
</li>
<li>
<p>The ranking algorithm’s use of its input should be <a href="/blog/2014/03/01/meaningfulness.html">meaningful</a>. The input rankings are not utilities, and it would be a mistake to turn them into utilities by assigning numerical values to each rank. Harry Potter and Ron Weasley may have had the same rankings and ended up in the same house, but Harry’s “not Slytherin” reflects a very different utility function than Ron’s preference for Gryffindor.</p>
</li>
<li>
<p>Each student should have a fair chance of getting one of their highly-ranked choices. For instance, even if we ignored the preferences and assigned the students completely at random, each student would have a 1/#colleges chance of getting their first choice. Surely we shouldn’t do any worse than that.</p>
</li>
</ul>
<p>The lack of a ranking from the colleges rules out stable marriage algorithms. The meaningfulness constraint implies that formulating and solving this as an instance of the <a href="https://en.wikipedia.org/wiki/Assignment_problem">assignment problem</a> (the problem of maximizing some weighted combination of rankings) would also be problematic. An assignment problem formulation would be problematic in another way, too. Suppose that, out of 100 students, 99 of them rank Ravenclaw first, Gryffindor second, Hufflepuff third, and Slytherin fourth, while the 100th student swaps the ordering of Hufflepuff and Slytherin. Then a weighted matching algorithm will always map that 100th student to Slytherin, violating the honesty and fair chance constraints.</p>
<p>But there is a system that satisfies all of these constraints. It’s the lottery system that I remember from my own college campus-housing-assignment days: we choose a random permutation of the students, and then assign each student (in this random order) the highest-ranked available position. It’s stable, because in any group of students the earliest one in the permutation has an assignment that can’t be improved. It incentivizes honesty, because your ranking doesn’t affect the permutation of the students and then once it’s your turn to be assigned you always want to be assigned according to your actual preferences. It’s meaningful, because it only uses the rankings to compare alternative assignments, and doesn’t try to do arithmetic on them. And it’s fair: if there are <script type="math/tex">H</script> colleges to be assigned to, then for every integer <script type="math/tex">i</script> in the range from <script type="math/tex">1</script> to <script type="math/tex">H</script>, you will have probability at least <script type="math/tex">i/H</script> of getting one of your top <script type="math/tex">i</script> choices, because you have that probability of being one of the first <script type="math/tex">in/H</script> students in the permutation (out of <script type="math/tex">n</script> students altogether) and if you are, it’s not possible for your top <script type="math/tex">i</script> choices to all fill up before you get your assignment.</p>
<p>I suspect this is not the system Santa Cruz actually uses, though, because they only ask for your top five choices rather than a full ranking and it’s reasonably likely that the unlucky last student in the random permutation would be stuck with a lower-down choice than that. Maybe there are other systems with similar good properties that can also avoid assigning anyone to a really low-rank choice, when such an assignment is possible? (It’s not always possible: consider what happens when everyone chooses the same ranking.) Or maybe they usually get enough students who just don’t care and don’t even return a ranking that they can put those at the bottom of the permutation and give everyone else a better choice? I’d be interested in hearing from anyone who has some inside information on this process. But regardless of whether Santa Cruz actually uses it, the lottery looks like a pretty attractive option for this problem.</p>
<p><a href="https://plus.google.com/100003628603413742554/posts/hsHuyTcBauZ">(Google+ discussion thread for this post)</a></p>David EppsteinThis is the week when high school students across the US finalize their decisions on which colleges and universities to go to in the following year. My son ended up choosing the University of California, Santa Cruz, where he wants to study computer science. Unlike faculty at many private schools, we get no special discount for sending our kids to the same university, but this is at least significantly less expensive than the other options he was choosing among, mainly similar-caliber state universities in other states. It’s also his mother’s alma mater.Russian Gulch photos2017-04-16T16:34:00+00:002017-04-16T16:34:00+00:00https://11011110.github.io/blog/2017/04/16/russian-gulch-photos<p>A couple of weekends ago I returned to Mendocino for a surprise 80th birthday party for my father. There’s now nonstop service from Orange County to Santa Rosa that makes this sort of short trip much more convenient: it’s only two more hours driving from there rather than four from any other airport we could reach.</p>
<p>Anyway, while there, we took a hike along the Fern Creek trail of Russian Gulch State Park, to the waterfall at the end of the trail. <a href="http://www.ics.uci.edu/~eppstein/pix/ferncreek/">My photos from the hike</a> are mostly studies of the green spring textures of the area. There is one of the actual waterfall, but because I packed light for the trip I didn’t have a wide enough lens to take it all in at once. Here’s a photo of a patch of forget-me-nots and horsetails by the side of the trail:</p>
<p style="text-align:center"><img src="http://www.ics.uci.edu/~eppstein/pix/ferncreek/5-m.jpg" alt="Forget-me-nots and horsetails on the Fern Creek Trail, Russian Gulch State Park, California" style="border-style:solid;border-color:black;" /></p>
<p><a href="https://plus.google.com/100003628603413742554/posts/6GaCjpULpBE">(Google+ discussion thread for this post)</a></p>David EppsteinA couple of weekends ago I returned to Mendocino for a surprise 80th birthday party for my father. There’s now nonstop service from Orange County to Santa Rosa that makes this sort of short trip much more convenient: it’s only two more hours driving from there rather than four from any other airport we could reach.First linkage for my new site2017-04-15T16:17:00+00:002017-04-15T16:17:00+00:00https://11011110.github.io/blog/2017/04/15/first-linkage-for<p>…and the first one I’m doing in markdown instead of html. Moving gives me a chance to rethink any blogging habits I might have gotten into, and change the ones that aren’t working, but I think I’ll keep doing these — regardless of whether others like them, I find them useful for myself for finding my old G+ posts. On the other hand, I’m changing up the format a little, to put longer description after the links instead of trying to limit each to a single line.</p>
<ul>
<li>
<p><a href="https://www.insidehighered.com/blogs/world-view/attack-independent-universities">Attacks on independent Universities in Europe</a> (<a href="https://plus.google.com/100003628603413742554/posts/QD9htP9fbgA">G+</a>). The Central European University in Hungary is under attack from the Hungarian government, but it’s not the only one.</p>
</li>
<li>
<p><a href="https://plus.google.com/+LuisGuzmanJr/posts/cNdYGCm8Ric">The Collatz conjecture in color</a> (<a href="https://plus.google.com/100003628603413742554/posts/ad5A1bKV6Sp">G+</a>). This visualization of the branching process inverse to the Collatz-conjecture process draws an infinite binary tree by turning a small amount left or right at each branch. It makes pretty organic-looking curved tangles of lines but I don’t think it is very helpful in distinguishing it from any other branching process with similar parameters.</p>
</li>
<li>
<p><a href="https://www.quantamagazine.org/20170404-gerrymandering-math-standard/">Quanta on gerrymandering</a> (<a href="https://plus.google.com/100003628603413742554/posts/EoY63gMG3cn">G+</a>). 30 years after “a ruling that rejected nearly every available test for partisan gerrymandering”, will the Supreme Court accept the “efficiency gap” standard used in a Wisconsin ruling from a lower court?</p>
</li>
<li>
<p><a href="http://news.livejournal.com/151767.html">Livejournal announces new terms of service</a> (<a href="https://plus.google.com/100003628603413742554/posts/6o4yfaepsTe">G+</a>). I moved my journal here and have now deleted my account from LJ (forgoing five months of already-paid service) because I cannot accept their newly-ubiquitous ads, restrictions on speech, rejection of pseudonymity, and promises to spam my email. See also <a href="http://www.metafilter.com/166250/LiveJournal-now-bans-political-talk-LGBT-talk">a related Metafilter discussion</a> which clarifies the mysterious “Federal act 149-ФЗ” parts of the ToS.</p>
</li>
<li>
<p><a href="http://www.latimes.com/local/abcarian/la-me-abcarian-pence-marriage-20170405-story.html">If professional women and men cannot be alone together, women are the ones who will pay a price</a> (<a href="https://plus.google.com/100003628603413742554/posts/U5k4HhFkx1o">G+</a>). Although ostensibly about Vice President Mike Pence and the US political right, this is also relevant for academia and the ongoing push from the left to shut down inappropriate relations between faculty and students.</p>
</li>
<li>
<p><a href="https://www.mathjax.org/cdn-shutting-down/">You need to update the MathJax library address in your web pages</a> (<a href="https://plus.google.com/100003628603413742554/posts/NFbUtWmp4ba">G+</a>). <a href="http://unix.stackexchange.com/questions/112023/how-can-i-replace-a-string-in-a-files">Here’s how</a>.</p>
</li>
<li>
<p><a href="https://www.smashingmagazine.com/2014/08/build-blog-jekyll-github-pages/">Building a blog with Jekyll and GitHub Pages</a> (<a href="https://plus.google.com/100003628603413742554/posts/dEUHR5WwLsz">G+</a>). What I did to move my blog. The comments discuss some related alternatives.</p>
</li>
<li>
<p><a href="http://retractionwatch.com/2017/04/05/supreme-court-nominee-gorsuch-lifted-earlier-works-scholarly-papers-report/">New Supreme Court Justice Gorsuch is a plagiarist</a> (<a href="https://plus.google.com/100003628603413742554/posts/Kp5zLiCjRWb">G+</a>). Not that that’s anywhere close to the worst thing about him or about the Trump administration that stole his seat for him.</p>
</li>
<li>
<p><a href="https://plus.google.com/+DavidRoberts/posts/R5XDjVpb6qc">Elsevier changed the terms of their “Open Access” user license</a> (<a href="https://plus.google.com/100003628603413742554/posts/FnK4qpAG5M3">G+</a>).
In particular it no longer seems to be permissible to display, adapt, or redistribute their papers. (If we can’t display them, how are we supposed to read them?) <a href="https://sbseminar.wordpress.com/2017/04/09/and-elsevier-taketh-away/">The secret blogging seminar has more analysis</a>.</p>
</li>
<li>
<p><a href="https://backchannel.com/how-google-book-search-got-lost-c2d2cf77121d">How Google Book Search got lost</a> (<a href="https://plus.google.com/100003628603413742554/posts/g4AUUHw7Xzc">G+</a>). The project doesn’t seem to be dead, exactly, but it’s stagnating. Via bit_player.</p>
</li>
<li>
<p><a href="After hyping itself as antidote to fake news, New York Times hires extreme climate denier">https://thinkprogress.org/new-york-times-hires-extreme-climate-denier-after-hyping-itself-as-antidote-to-fake-news-441826c4071d</a> (<a href="https://plus.google.com/100003628603413742554/posts/QioGH2Hrtgs">G+</a>). Bret Stephens may be an anti-Trump Republican but that doesn’t prevent him from being a shill on other matters.</p>
</li>
<li>
<p><a href="http://dangerousminds.net/comments/soap_bubbles_become_psychedelic_works_of_art">Soap bubble photography</a> by <a href="http://williamhortonphotography.com/">William Horton</a> (<a href="https://plus.google.com/100003628603413742554/posts/jCgmt3147jC">G+</a>). Via <a href="http://www.metafilter.com/166297/Thin-line-between-heaven-and-here">Metafilter</a>.</p>
</li>
<li>
<p><a href="https://mathoverflow.net/q/263667/440">Drawing trees on small number of lines in 2D and 3D</a> (<a href="https://plus.google.com/100003628603413742554/posts/SXYxF2rwi2k">G+</a>). By re-using the same line for many edges, it is possible to draw some trees on many fewer lines than the number of edges in the tree. Does it help use fewer lines to use lines in 3d instead of in the plane?</p>
</li>
</ul>David Eppstein…and the first one I’m doing in markdown instead of html. Moving gives me a chance to rethink any blogging habits I might have gotten into, and change the ones that aren’t working, but I think I’ll keep doing these — regardless of whether others like them, I find them useful for myself for finding my old G+ posts. On the other hand, I’m changing up the format a little, to put longer description after the links instead of trying to limit each to a single line.Stable grid matching2017-04-11T17:17:00+00:002017-04-11T17:17:00+00:00https://11011110.github.io/blog/2017/04/11/stable-grid-matching<p>A group of us at UCI have been trying to understand algorithmic problems in political redistricting (with the goal of finding methods that are fair, difficult to gerrymander, and efficient to calculate). Although the real goal is political fairness (a difficult concept to define let alone optimize for), there are other important criteria in redistricting: districts should be close to equal in population, and should be geographically compact.</p>
<p><a href="/blog/2016/06/15/linkage.html">In a post last year</a> I linked to some interesting work from Yuval Peres and others at Microsoft Research on <a href="http://yuvalperes.com/stable/stable.html">Voronoi diagrams defined from stable marriages</a> rather than closest distances. This method takes a region of the plane (the unit square, say) and a collection of center points within the region, and divides the region into equal-area subregions that are (mostly) close to their centers; an example is shown below, with colors distinguishing the different subregions from each other. We thought this might make a good abstraction to the redistricting problem: if the centers represent voting places, and units of area represent numbers of voters (so the voters are uniformly spread around the square) then it will give us subregions for each voting place that are of equal population and near their voting place. More specifically, by definition, there should be no pair <script type="math/tex">(v,p)</script> where <script type="math/tex">v</script> is a voter whose assigned polling place is <script type="math/tex">p</script> and <script type="math/tex">p</script> is a polling place whose assigned voters include people farther than <script type="math/tex">v</script>.</p>
<p style="text-align:center"><img src="/blog/assets/2017/stable-grid-matching.png" alt="900 x 900 stable grid matching" /></p>
<p>Our new preprint, “<a href="https://arxiv.org/abs/1704.02303">Algorithms for Stable Matching and Clustering in a Grid</a>” (arXiv:1704.02303, to appear at IWCIA) starts by looking at efficient algorithms for this problem. It turns out that the fact that the preferences are symmetric (both voters and voting places prioritize each other by the same distances) really helps. One could use a dynamic nearest-neighbor data structure to repeatedly find and match voters to places until each voter has a place and each place has enough voters, but one can do even better than that by applying the <a href="https://en.wikipedia.org/wiki/Nearest-neighbor_chain_algorithm">nearest-neighbor chain algorithm</a> to repeatedly find mutual nearest neighbors of unplaced voters and not-yet-full voting places, avoiding the overhead of nearest neighbors. Based on this idea, we show that constructing images such as the one above by matching an <script type="math/tex">n\times n</script> grid of pixels to some smaller number of centers can be performed in time <script type="math/tex">O(n^2\log^5 n)</script>. This is the first application I’m aware of outside of hierarchical clustering for nearest-neighbor chains. But the underlying nearest neighbor data structures used in these algorithms are still too complicated to be practical, so instead we implemented and experimented with simpler heuristics that nevertheless obtain significant speedups over naive stable marriage algorithms.</p>
<p>Choosing the voting places (the centers of each region) uniformly at random tends not to work too well. Random fluctuation causes some parts of the grid to have too many centers and others too few, and when that happens the centers in the dense regions have to reach out a long distance to find voters in the sparse regions who can be assigned to them. We end up getting disconnected regions with very high radius. (In the above image, the boundaries are straight lines and circular arcs centered on the voting places, so you can find these bad regions by looking for high-radius curved boundaries.) We thought we would be able to fix these problems by using a variant of <a href="https://en.wikipedia.org/wiki/Lloyd%27s_algorithm">Lloyd’s algorithm</a> adapted for this kind of Voronoi diagram: alternate between steps that compute the stable matching and that move the centers to a more central point within their region. But although it did lead to better-behaved subdivisions, it didn’t entirely eliminate the problems.</p>
<p>Of course, actual voters must deal with road distance not straight-line distance. For instance, Lucia, California and King City, California had very different road distances and straight line distances, even before the recent <a href="http://www.mercurynews.com/2017/03/20/caltrans-highway-1-replacement-bridge-in-big-sur-ready-in-six-months/">bridge outage on Highway 1</a> making it essentially impossible to get from one to the other. And populations are not evenly distributed by area. So, beyond the question of finding an equal-area stable subdivision, turning this into a usable redistricting algorithm will require additional research.</p>
<p><a href="https://plus.google.com/100003628603413742554/posts/BketmBvibi4">(Google+ discussion thread for this post)</a></p>David EppsteinA group of us at UCI have been trying to understand algorithmic problems in political redistricting (with the goal of finding methods that are fair, difficult to gerrymander, and efficient to calculate). Although the real goal is political fairness (a difficult concept to define let alone optimize for), there are other important criteria in redistricting: districts should be close to equal in population, and should be geographically compact.Back up and running2017-04-10T17:59:00+00:002017-04-10T17:59:00+00:00https://11011110.github.io/blog/2017/04/10/back-up<p>My transition from LiveJournal (taken over by Russians with new and unacceptably restrictive terms of service) to another journaling system is more or less complete. I think all of my old posts and most or all of the comments are now here at this new address (11011110.github.io/blog/ at least for now — in principle I could replace that with a custom domain name but I haven’t seen the need to do so yet). Because I’m hosting this through github, the actual <a href="https://github.com/11011110/blog">source code for the blog</a> is also public. The old LiveJournal site still exists but I’m likely to take it down sometime before the next automatic renewal of my paid account in September, so now would be a good time to update links. If I ever do get a custom domain or otherwise change hosts, I’m very likely to keep the same naming scheme, so future updates should be a lot easier.</p>
<p>Some miscellaneous observations on the transition:</p>
<ul>
<li>
<p>Keep backups of content you host on third-party sites! They’re what allowed me to make this transition as easily as I did. And the inability to continue making backups was what first alerted me to the fact that something had gone seriously wrong at LiveJournal.</p>
</li>
<li>
<p>I’m using Jekyll, a static site generator, to turn blog entry text or markup (but without all the boilerplate html connecting the entries to the rest of the blog) into web pages. GitHub automatically runs Jekyll so I don’t have to, but for the past few days while I convert the old entries to the new format I’ve been running a local copy of the Jekyll server code, so I can preview what the entries look like without having to put them up for the world to see. Rebuilding the whole site (around 1300 posts) takes between 30 and 90 seconds depending on which machine I run it on, not really real-time but a lot better than some old complaints about Jekyll that I found on the net. This workflow should even allow me to work with the blog on airplanes or other locations without network connectivity, and the fact that it’s all in git (a version management system I’m already familiar with from using for my papers) gives me a local backup automatically and makes synchronization between machines easy.</p>
</li>
<li>
<p>Because I’m using a static site generator, I don’t have a place for people to leave comments on the posts (the old comments are baked in and I don’t plan to allow new ones). Probably Jekyll can integrate with third-party commenting services like Disqus but I’m not looking into doing anything like that. Instead my current plan is simply to link to new posts on Google+ (as I have been doing anyway) and then adding a link to each new post pointing to the Google+ link as a place where comments can be added. Disallowing third parties from adding any content to this site saves me from a lot of headaches involving security and spam.</p>
</li>
<li>
<p>As well as comments, the other feature I don’t have any more is tags. It is possible to generate tags in Jekyll but my impression is that it creates significant slowdowns in build time. I’m not sure how easy it is to get them to work on GitHub. And I don’t think I was making very effective use of them on LiveJournal, so if I did start using tags again I’d want to rethink how I categorize things.</p>
</li>
<li>
<p>All the old posts are formatted in html but for newer ones I am likely to use markdown (or actually kramdown, the default for new Jekyll sites) to save on having to get the details right (like making sure I close all of my tags). As part of the process of converting the old blog entries to the new format, I found and fixed quite a few errors of this type. And it looks like if I want to do something trickier in html, it should usually be possible to do it within a larger markdown post. Both formats allow MathJax-formatted mathematical expressions: Jekyll and kramdown support MathJax almost out of the box, with the only change needed being to add the appropriate JavaScript to my headers. They also provide syntax highlighting for code snippets. I have used these features to clean up the formatting on some old posts but many of them remain in their old pre-MathJax state.</p>
</li>
<li>
<p>I’m using the vanilla theme that comes with Jekyll, “minima”, with only a few customizations e.g. to what appears in the page headers and footers. Github only supports a small number of themes (if you host elsewhere there are many more to choose from) and I liked the simplicity of this one better than the others. It doesn’t get in the way of reading the actual posts with too much decoration. One of the mistakes I made, though, was thinking that I needed to use github’s “set theme” pulldown menu. Minima is not one of the themes listed in this pulldown and when I tried it I broke the site.</p>
</li>
<li>
<p>There are still a couple hundred links from my posts to other posts, pointing to the old location of the blog. It will take a concerted effort to clean all of these up, if I ever bother to do so. Unfortunately there’s no good way to automatically translate the addresses of those old posts into the new addressing scheme; it’s a matter of looking them up one-by-one by hand. Probably there are other ongoing issues with bad formatting. If you discover any (especially on posts that have some ongoing significance, which is far from all of them) please let me know so I can fix them.</p>
</li>
</ul>
<p><a href="https://plus.google.com/100003628603413742554/posts/KSuMcfE7AxQ">(Leave comments on Google+)</a></p>David EppsteinMy transition from LiveJournal (taken over by Russians with new and unacceptably restrictive terms of service) to another journaling system is more or less complete. I think all of my old posts and most or all of the comments are now here at this new address (11011110.github.io/blog/ at least for now — in principle I could replace that with a custom domain name but I haven’t seen the need to do so yet). Because I’m hosting this through github, the actual source code for the blog is also public. The old LiveJournal site still exists but I’m likely to take it down sometime before the next automatic renewal of my paid account in September, so now would be a good time to update links. If I ever do get a custom domain or otherwise change hosts, I’m very likely to keep the same naming scheme, so future updates should be a lot easier.