This week I'm in New Jersey for the Workshop on Algorithms in the Field. Despite the organizers attempting to leave some big blocks of time for breakout sessions, the schedule is packed with good talks.
The first speaker at the workshop was Deepak Agarwal from Yahoo. The intended point of the talk was to describe algorithms for recommending web content — if you go to the front page of Yahoo, what you see is customized to relate to what Yahoo believes your interests to be. However much of the talk became into a discussion of "What Yahoo Stands For". What Yahoo stands for appears to be a corporatized, sanitized view of the web in which readers are purely consumers of infotainment, not producers of anything, and in which they should only be presented with articles on subjects that they have already shown interest in. Everything is filtered through human editorial enforcement to keep out any off-color content or anything else that might be seen as damaging to Yahoo's corporate image, and then filtered through computer algorithms to prevent you seeing anything that might challenge your expectations. I think a good illustration of the point of view comes from the level of detail of the topics he used as examples: "Tiger Woods" and "Britney Spears" were mentioned as topics on par with "Finance".
The second talk was by Stephen Fienberg, a statistician who works on social networks, started from a generative model of network formation and led from there to a simple exponential random graph model in which the log-likelihood of a graph is a weighted sum of features that count the total number of edges in the graph and the degrees of each vertex. He also talked about latent space models and a statistical block model. All of this was quite familiar to me because I've been hearing the same things for a couple of years from my collaborators on a large social network grant that I've been part of. His emphasis was less on algorithms than on why this is a principled statistical model for the problem, though, so he didn't have a chance to talk about the interesting problems in dynamic graph algorithms that come up in trying to make the MCMC estimation methods for these models run quickly.
Sergei Vassilvitskii gave a nice overview of MapReduce and how it relates to other models of high performance computing for massive data such as massive parallelism and data streaming; unfortunately I was too jetlagged to pay a lot of attention.
Jim Demmel talked about big practical speedups in numerical linear algebra, both for direct methods and iterative methods. The iterative methods especially had an impressive improvement: the number of iterations no longer appears in the amount of communication needed (the real bottleneck, rather than the number of flops). At least as he told it, a lot of the motivation for finding these algorithms was in trying to make them match theoretical lower bounds on the same problems. One takeaway message from the iterative part of the talk was that his algorithms involved both dense and sparse linear algebra, and it didn't work well at all to tune these two parts separately because then they both competed for the same cache space: it was necessary to wait until the entire system was put together to start tuning it. Another was to parametrize the implementation well enough that autotuning can be used to fit it to many different specific architectures.
Nancy Amato gave a shorter version of a talk I'd already seen in a one-hour version, on configuration space and probabilistic roadmap techniques for motion planning problems with complex geometry and multiple agents.
Alon Halevy's talk had a forbidding title involving ontology. But what it was actually about was a lot of stories about extracting useful database-like information from the tables and other information that people have put haphazardly on the web, combining it with information that various agencies have shared with Google, and putting it all into useful visualizations. One of the more amusing was a map of Tasmania (a type of symbol map) showing that it has a much higher number of public toilets than the rest of Australia; apparently this confirms all of the rest of Australia's prejudices about what Tasmanians are like.
John Doyle: "We have theories, but they're fragmented, incoherent, and incomplete." We have problems that have arisen from the solution to other problems (autoimmune diseases from an immune system that responds well to infectious diseases) and the need for those solutions is now less strong than it was when they evolved. From which he concludes that we should go back to having more infectious diseases in order to better control our autoimmune diseases. Or at least that's what I got from his talk. Also: you have ten times as many signals going from your brain to your eyes than vice versa. The way our eyes work is that the brain sends a fat pipe of predictions out to the eyes and gets back a much narrower pipe of differences from the predictions. That's why we dream so vividly. Also also: DNS is a mistake, because it breaks the layering structure of the internet by allowing the top application layer to gain access to a much lower addressing layer.
Kristen Bennett said something about mathematical optimization in machine learning algorithms involving support vector machines and how to choose cross validation set sizes that I didn't get much out of. But then somehow at the end it morphed into a graph drawing problem, of drawing planar graphs (or maybe they're just trees) without crossings and with the distances approximately matching some given distance data.
Graham Cormode had a nice survey of data sketching algorithms, which he represented as a haiku that I unfortunately failed to copy down. But basically he's interested in methods that have some properties among: sublinear storage, randomized, linear, can be added together; they include methods such as Bloom filters, minhashing (for which he described a nice variation that estimates the number of distinct items in a set with good multiplicative error), and the count-min sketch (which gives an additive approximation to the frequency of each item in a data stream). Towards the end he listed situations in which these methods are actually used "in the field": he included web log analysis and compressed sensing, and explicitly excluded sensor networks because the theoretical applications of sketching in sensor networks as huge randomly placed networks with too much data to store do not match the scientific applications of at most a dozen or two carefully placed sensors from which one wants to collect all the raw data. He also listed quite a few potential applications, but I think he could have included a lot more real applications if he had included Bloom filters in his application section as he did in his historical survey.
Aloysius Mok spoke about practical applications of temporal logic — it's very important problem in real-time control systems but unfortunately he lost me quickly.
David Karger spoke about several problems in which preexisting theoretical solutions could be plugged into some "in the field" problem. The first involved some statistical machine learning approaches to natural language processing: for instance automatically writing titles for newspaper articles. The machine learning people had already reduced the problem to one of finding maximum-weight paths of length k in a large graph; Karger observed that color coding led to a nice fixed-parameter-tractable algorithm. The second problem he spoke about involved network coding, a form of path sharing in network broadcasting used in the Microsoft "Avalanche" file distribution system; the third involved finding a good load-balanced assignment of users and documents to web caches such as the ones in the Akamai system.
Thank you for this post, I will read about MinHash. > The first speaker at the workshop was Deepak Agarwal from Yahoo. See also: http://www.youtube.com/watch?v=B8ofWFx525s
> , I will read about MinHash. I did know it.
Thanks for the link — very relevant.