# 8F day three

Today at the Workshop on Algorithms in the Field there was only a half-day of talks.

As Krishnaram Kenthapadi described in his talk, textbooks in developing countries (by which read: India) are not uniformly well written. In particular they may have issues involving stilted text, bad or missing coverage of certain important topics, and political biases. Kenthapadi's project concerns automatically identifying these gaps by analysing the syntactic complexity and topic coverage in a text, and then finding web content (especially text and images) that can be used as supplementary material. One of their working hypotheses is that, in well written texts, the key concepts in each section are well connected (their co-occurrences form a relatively dense graph), while less well written texts present multiple isolated concepts in their sections making them hard to understand. They find concepts as noun phrases in the text, clustered by mapping them to Wikipedia articles. Also, to determine the weights to assign to their concept density and syntactic complexity measures, they use a machine learning algorithm with a training set based on finding the best fit of a document's text to different versions of Wikipedia articles and letting the maturity of the best-fitting version stand in as a proxy for document quality. This seemed interesting to me not so much because of the heavy use of Wikipedia within their algorithms (which seemed rather badly motivated to me) but more because, if this sort of analysis turns out to work well for textbooks, it should also work for other types of technical writing such as research papers or Wikipedia (as his "future work" slide suggested).

Kamal Jain spoke about competitive analysis of online algorithms, an area of research that was very hot in the early 1990's, then went through a bit of a lull, but seems (judged from the number of papers published by year in Google scholar) to be coming back in. His application was in matching advertisers to ad slots, which can be modeled as an online matching problem in which one side of a bipartite graph (the advertisers) is known ahead of time, and the vertices on the other side (the ad slots) arrive online. His starting point was a paper by Karp, Vazirani, and Vazirani from 1990 that gets a deterministic competitive ratio of \( 2 \) and a randomized competitive ratio of \( 1-1/e \) to the maximum matching. The ad problem is a weighted rather than unweighted matching, but it's still possible to get a competitive ratio of \( 2 \) or \( 1-1/e \) (with the better ratio depending on an assumption that each advertiser has the budget for many individual ad placements). His algorithm has the interesting property that it has a tunable target competitive ratio that could be even better than \( 1-1/e \). If the algorithm were to be run on worst-case instances, it would be a bad idea to tune this parameter higher than \( 1-1/e \) (then it would simply fail to find a solution with that quality) but with some stochastic assumptions about the instance (for instance an assumption that there will be enough slots to consume 5% of each advertiser's budget) one can tune the parameter much higher. His system is also capable of replacing hard budget constraints with softer concave utility functions, which again leads to an improvement in the competitive ratio. There was some magic involving primal and dual nonconvex programs such that the ratio of their solutions is the competitive ratio and a duality between primal worst-case instances and dual differential equations describing the instantaneous behavior of the algorithm, but I didn't really understand that part. He told us that this research saved Microsoft a large amount of money by eliminating the need to buy a company that was solving a similar ad targeting problem in a less well-principled way that wasn't yielding as good results.

D. Sivakumar's talk concerned path properties in social networks. "Similarity networks" connect pairs of similar elements in a large system of elements, with a broad notion of what "similar" means that may be based on co-visitation or co-citation: essentially, given a bipartite graph (of authors and their research papers, for instance, or web pages with incoming links on one side and outgoing links on the other side, or people and societies they belong to), one can get a similarity network by keeping only one side of the bipartition and using paths of length two in the original bipartite graph as edges in the similarity graph. The first part of the talk centered on a preferential-attachment model for generating synthetic similarity networks and some theoretical results showing that these synthetic networks have the small world property. This part was based on a notion of "core" and "hub" nodes: the core nodes are the ones with high degree and the hub nodes are the ones within distance one of the core. In his model, the core is highly likely to have bounded diameter and the neighbors of the hub are highly likely to cover a large fraction of the graph. Sivakumar then shifted gears to discuss a repetition of Milgram's routing experiment on real co-authorship graphs: can randomly chosen pairs of actors in these graphs route messages to each other by only sending the messages along links to people they've co-authored with? They did this experiment using several different choices of routing algorithm, starting with a greedy routing algorithm in which the distance between two people is measured by their most similar pair of interests and messages must be sent along edges that decrease the distance to the target. They also included variations of the greedy routing algorithm that allow steps farther from the target, prefer high degree nodes, perform some amount of memory and lookahead, or expand the targets' interests to include the interests of the target's neighbors. The base greedy algorithm did quite poorly at finding paths; the variations on the base algorithm helped a lot, but even with all of their enhancements to it, it only worked 80% of the time.

Phil Gibbons argues that heterogeneity in cloud computing is a good thing. This may seen counterintuitive, but as he says, specialization (division of labor) is "fundamental to efficiency of societies" and the same factors apply here. If data centers aim for a mix of server architectures and algorithms are designed to exploit that, we can gain much greater efficiencies. The reason is that different problems lead to different patterns of usage for resources such as disk, memory, and cpu cycles. Three additional factors that push towards heterogeneity are the increasing scale of data centers (giving economies of scale), the move towards optimization of energy costs, and the need for data centers to provide coverage of multiple applications that may have been written for incompatible platforms. As an aside (while pushing for more algorithms research in this area) he called theory "asleep at the wheel" with respect to multi-core architectures: "it's the only discipline in computer science that's not thinking about parallelism".