Happy new year! As in previous years the growth of the cs.DS section of arXiv.org continues: there were 935 data structures and algorithms preprints this year, compared with 798 in 2011. We're not quite to the point of having 100 papers per month but July came close with 98.

Here is a personal selection of ten preprints from this list that I found particularly interesting, excluding (as usual) my own preprints as well as some others that I wrote about earlier. (Also excluded: papers like Orlin's faster max flow algorithm that haven't been submitted to the arXiv.) In chronological order, they are:

  • Sublinear time algorithm for PageRank computations, Christian Borgs, Michael Brautbar, Jennifer Chayes, and Shang-Hua Teng, arXiv:1202.2771 and WAW 2012. It's only slightly sublinear but it's interesting to me that we can prove such bounds, in a natural model of computation for web graph access.

  • Finding a most biased coin with fewest flips, Karthekeyan Chandrasekaran and Richard Karp, arXiv:1202.3639. The problem this solves seems very fundamental, and the result improves previous constant factor approximations, instead solving the problem optimally.

  • Strongly universal string hashing is fast, Owen Kaser and Daniel Lemire, arXiv:1202.4961. From the practical side of algorithmics, a method for hashing strings that uses less than one instruction per byte and nevertheless has guaranteed theoretical properties.

  • Known algorithms for EDGE CLIQUE COVER are probably optimal, Marek Cygan, Marcin Pilipczuk, and Michał Pilipczuk, arXiv:1203.1754 and SODA 2013. There is a simple FPT algorithm for covering the edges of a graph with a minimum number of cliques, which is however doubly exponential in the number of cliques. Apparently (see e.g. this stackexchange question) it's an old open question to do better than this. Last year it was shown that no better kernelization is possible, and now this preprint shows that double exponential is optimal regardless of whether it's achieved using a kernelization, assuming the exponential time hypothesis is true. Apparently it's the first known (if conditional) double exponential lower bound for FPT algorithms.

  • Approximability of the vertex cover problem in power law graphs, Mikael Gast and Mathias Hauptmann, arXiv:1204.0982. I think there should be more work on algorithms that take advantage of the properties of web graphs and social networks to compute properties of these graphs more quickly than they can be computed in the worst case, and this is an example: it gets a better approximation ratio for vertex cover than can be achieved in the worst case, for any graph whose degree distribution obeys a power law.

  • Clustering is difficult only when it does not matter, Amit Daniely, Nati Linial, and Michael Saks, arXiv:1205.4891. Like the previous paper, this represents a move from worst-case complexity towards something more instance-based. The main idea here is that the only hard instances for clustering problems (under traditional worst-case algorithms) are ones in which the input is not actually clustered very well. Their definition of a "good clustering" seems very sensitive to outliers or noisy data, but perhaps that can be a subject for future work.

  • A randomized parallel algorithm with run time \( O(n^2) \) for solving an \( n \times n \) system of linear equations, Joerg Fliege, arXiv:1209.3995. The real breakthrough here is Raghavendra's layered generate-and-test algorithm for linear equations over finite fields, described earlier this year on Richard Lipton's blog. This note shows how to do the same thing over the real numbers.

  • Faster deterministic fully-dynamic graph connectivity, Christian Wulff-Nilsen, arXiv:1209.5608 and SODA 2013. Ok, it's log shaving (actually loglog shaving) but on an important problem.

  • Eight-fifth approximation for TSP paths, András Sebö, arXiv:1209.3523 and IPCO 2013. I'd taught Christofides algorithm enough times to know to be careful about paths vs cycles in it, but somehow it had flown under the radar for me that the approximation ratio for paths was significantly larger.

  • Approximating \( k \)-median via pseudo-approximation, Shi Li and Ola Svensson, arXiv:1211.0243. Here the \( k \)-median problem is that of finding exactly \( k \) cluster centers minimizing the sum of distances of each point to its center, an approximation again finds exactly \( k \) centers but only approximately minimizes the sum, and a pseudo-approximation finds a number of centers that is only approximately \( k \). As the authors say, using even one extra center may make a big difference in the objective function, so it is a bit of a surprise that pseudo-approximation can be useful in approximation for this problem.



I don't understand the excitement over the paper on solving linear equations: it runs in \( O(n^2) \) parallel time, but requires \( O(n^3) \) work. This is worse than Gaussian elimination, which can naively be implemented to run in \( O(n\log n) \) parallel time with \( O(n^3) \) work. Moreover, the constants in the big-O's in GE are better.

  --Dan Spielman

With GPGPU spreading around algorithms it's important how GPGPU friendly parallel algo is. GPGPU friendliness depend on how tightly coupled components of solution are, the less the better. Generate and test is very GPGPU friendly, CUDA implementaion would be a lot more efficient than parallel G-S.

Regarding the linear equations paper: I think the algorithm can fail in ways not considered by the author, if some unfortunate choices of pairs are made in step (a) over several rounds.

For example, suppose that in round 1 one chooses the pairs (1, 2), (1, 3), and (2, 3). Then the resulting three vectors -- let's call them x1, x2, x3 -- will lie on a line.

Then x1, x2, x3 become v1, v2, v3 of the next round.

Then suppose that at the next round we choose the pairs (1, 2) and (1, 3). Then the resulting two vectors will be *equal*. This leads to a failure in the next round if this pair is chosen.

In fact, I don't think one needs any randomness in the choice of the pairs. I think one can just take the pairs (1,2), (1,3), (1,4), ... . This way at each step one has one vector less than before, as the author himself suggests at the end of the paper. I think the vectors will always remain affinely independent (with probability 1), so they will always affinely span the space of solutions to the equations considered so far.

I wrote the author about this.

-- Gabriel Nivasch

Here’s a motivational poster for log shaving I put together a while ago: https://blog.itu.dk/efficientcomputation/2009/12/26/motivation-for-2010/