8F day two

There were fewer talks and more breakout session time on day two of the Workshop on Algorithms in the Field, but there were still some very good talks.

Gary Miller talked on solving linear systems defined from weighted graphs (essentially the diagonal degree matrix minus the adjacency matrix). These systems have surprisingly many applications involving electrical circuit simulation, heat flow in circuits, computation of Google pageranks, image segmentation, image denoising, maximum flows, approximate graph separators, graph drawing by simulation of spring systems, and spectral graph drawing and partitioning. Spielman and Teng in 2004 showed that these linear systems can be solved approximately by iterative methods in an amount of time that differs from linear by a polylogarithmic factor, but a big one: log¹⁵ or so. Miller shows how to cut the logs down to one or two. The main idea is to use preconditioners formed by a combination of random sampling based sparsifiers (with sampling probability proportional to electrical resistence, estimated using low-stretch spanning trees) to reduce the number of edges and judicious use of Gaussian elimination to reduce the number of vertices. But as Gary said somewhere in the middle about one of his steps, "Unfortunately it would be nice to have practical algorithms for this...the low stretch spanning tree is a problem." The fact that all of these algorithms are numerical and approximate may also be a limitation, at least in some of the applications: in Tutte's spring embedding method for finding convex drawings of planar graphs, for instance, one needs polynomially many bits of precision to find a good solution, and these methods have a term in their running time proportional to the number of bits.

Anirban Dashupta spoke about his work at Yahoo dealing with large sparse heavy tailed distributions on user data. One example was spam filtering: after some preliminary filtering steps, they turn emails into word frequency vectors, use random projection to reduce the dimension of the vectors, and represent user filtering preferences as linear separations of the reduced-dimension space. Another topic in his talk was finding near-duplicate query results using locality-sensitive hashing (by now a fairly standard idea); by reducing the independence of the different hash functions used in this step he was able to perform it more quickly.

Eugene Fiume gave a beautiful talk about the challenges inherent in computer graphics that largely ended up being a review of art history. He also showed some optical illusions, an uncanny valley example, and an amusing applet of a walking stick figure with a slider for male/female stereotyped behavior. "The human vision system doesn't just sense things, it tries to make sense of things", which is why techniques such as dithering can improve the appearance of images despite reducing mathematical measures of how accurate they are.

David Blei gave us an introduction to topic modeling: in a now-standard method in this area developed by Blei, latent Dirichlet allocation, one models a topic as a probability distribution over words and the words in an individual document as being drawn from a mixture of a small number of topics. Topics can drift (their distributions of words changes over time), and he discussed an idea for using this to measure the impact of a paper: influential papers are the ones with the biggest effect on topic drift. Despite having nothing to do with citation counts or other social network data (it just uses the words in the papers) this does have a strong correlation to citation counts, but it also misses some influential papers (an example being Marcus et al. 1993 on the Penn Treebank, which is influential less in terms of its language and more in the dataset it describes). He also talked about using these models in recommendation systems: on their own (using only document word content to recommend research papers to other researchers) they work pretty badly, but they can be used together with collaborative filtering to give better recommendations than collaborative filtering alone.

David Kempe took a skeptical point of view on the artificiality of mathematical models of social networking problems. For instance, finding influential sets of nodes in a simulated epidemic is hard to optimize and hard to approximate, but a greedy algorithm works very well when certain thresholds in the model are uniformly random. We claim we're making this assumption because we don't know what the thresholds should really be, but it also makes the math much nicer. Is it at all realistic, though? Another example he looked at was the Watts–Strogatz–Kleinberg small world model in which a grid graph with inverse-square-probability long range lengths allows greedy routing to find short paths. The model is very delicate — other probability distributions on the long range edges don't work – but it turns out that it is not at all a good fit to real data from LiveJournal, in part becase the underlying geographic distribution of LiveJournal users is highly nonuniform. He asks "are models driving algorithms, or are algorithms driving models?" As computer scientists, we're not the only ones guilty of doing this: he skipped over it more quickly because his audience consisted of computer scientists, but he also had slides about similar problems in the physics community's work on social networks. His final recommendation: actually talk to the sociologists.

According to Balaji Prabhakar, societal networks (such as transportation or the energy grid, as distinguished from social networks) combine resources, technical mechanisms, and human actions. We can improve the technology, but we can also change the incentives for people so that they become more likely to do the right thing. As an example, he talked about a project that involved convincing people to return recyclable bottles. Instead of offering a nickel for each bottle, he offered people a small chance at a much larger amount of money, with the same expected value, so that monetarily the effect is the same, but (because the utility to the individual people is different) the incentive is much greater. It wasn't clear how much of what he talked about involved computational challenges rather than human-engineering challenges, though.

Martín Farach-Colton started working on external memory data structures after the failure of the PRAM model in the early 1990s (which he still sees as a failure of nerve rather than a failure of ideas). In part he chose data structures because of his contrarian nature: data structures are not as popular a subject for research as perhaps they should be, because of a prevailing view that they can be ugly and messy and overly detailed (I don't hold this view, and sometimes work on data structures myself because I think they can be pretty). As Martín misquoted Feynman: "Algorithms research is like sex: sure it may give some practical results but that's not why we do it". He also taught us something about the history of data structures: what does the B in B-trees stand for? It's Boeing, because they were invented at Boeing Labs. Another quote, regarding the motivation for cache-oblivious data structures (you "only" have to pay a constant factor slowdown in exchange for not having to tune them, and they work hard to optimize every level of the memory hierarchy when only the outer level actually matters): "Every theory paper has to have a bogus embarrassing motivation in the introduction and every systems paper has to have a bogus embarrassing theorem." But the meat of his talk came with the observation that this motivation was inaccurate: the cache-oblivious replacement for B-trees doesn't have the slowdown one might expect it to, and is in some cases significantly faster than B-trees despite not being tuned to the hardware the way B-trees can be tuned. The reason the cache-oblivious structures work so well is that prefetching is an important part of the memory hierarchy, but doesn't work using fixed block sizes, instead providing long sequential reads of your disk with unpredictable sizes, and cache-oblivious data structures work well for this kind of access pattern. Martín took this insight and ran with it for several important systems problems: he showed how to design file systems that allowing both insertions of many tiny files and traversal of the whole disk to be fast, and in databases he showed how to allow for random insertions into sorted indices while only paying a tiny fraction of a disk access per insertion rather than a whole disk access. In the database application, testing whether the inserted keys are unique is a bit more problematic; Bloom filters can be used to solve this subproblem, but Martín has an improvement with better memory behavior in an upcoming paper at Hot Storage. It took 11 years for him to go from from "cache oblivious algorithms seem like a cool idea" to "my customer's system runs 80 times faster", though, so it's important for people like NSF to support basic research in the hope that some it will in future become equally successful.

Comments:

dkorduban:
2011-05-18T13:04:49Z
Nice reading! I always thought that B stands for block. Anyway, there are a number of versions. http://en.wikipedia.org/wiki/B-tree#Etymology_unknown

11011110:
2011-05-18T13:45:52Z
Thanks for the pointer. I should have made more clear that "Boeing" was Martín's assertion, not something I am taking as necessarily true myself. Within the talk, I think the "block" and "Bayer" hypotheses were suggested by the audience, but Martín didn't think very highly of the Bayer possibility because it ignore's McCreight's contribution.

11011110:
2011-05-18T15:32:36Z
Apparently Bayer himself denies the Boeing story, also, but won't say what it really stands for.

11011110:
2011-05-19T14:24:11Z
David Kempe just sent me the following clarification by email: When I talked about the LiveJournal experiment to verify the Small World model, the point was not so much that the Small World model was not a fit with real-world data. It was more that the initial methodology - use the model as though the nodes were evenly spaced - was in retrospect the wrong way to do things, and the natural fix is to correct for densitites with the "rank-based" model. The simplest and most natural fix, without any further tuning, gave an excellent fit. So my point was that the model may actually be quite accurate, and moreover, that the slight variation proposed in the LiveJournal paper was a very well-motivated modification (as opposed to a large number of other model modifications, which I personally see as highly unmotivated). I guess I didn't articulate that as clearly as intended in my talk.