TwitterScope gets best paper at GD
This year's Graph Drawing Symposium is coming up in less than two weeks, and has already announced the winners of the best paper awards in each of the two submission tracks, "combinatorial and algorithmic aspects" and "visualization systems and interfaces". The winner in the theory track was my paper on Lombardi drawing, but I already posted here about that, so instead I wanted to say some more about the other winner, "Visualizing Streaming Text Data with Dynamic Graphs and Maps" by Emden Gansner, Yifan Hu, and Stephen North. A preprint version of their paper is online at arXiv:1206.3980.
The paper describes the TwitterScope project, which provides visualization tools for high-volume streams of text data (e.g. from Twitter). Currently it exists as a publicly viewable prototype, set up to choose among nine preselected topics. Recent tweets are shown as small icons, grouped into colored regions (representing subtopics) within what looks like a map. Hovering the mouse over an icon shows the corresponding tweet. It's updated dynamically as new tweets come in, and has a timeline for viewing past tweets. My feeling from the description is that the work involved in putting this system together was less about coming up with new technical methods for visualization (although there is some of that there, particularly in how they handle disconnected graphs) and more about making a selection among many different ideas previously seen in this area and putting them together into a single coherent and well-engineered system. Which I guess should be what that track is all about.
To turn tweets into maps, the software uses standard tools from text retrieval (a calculation of the angle between high-dimensional normalized word frequency vectors) to estimate the similarity between tweets. Thresholding these similarity scores to drop pairs of tweets that are too far apart turns this information into a graph. The paper says that they also tried using more sophisticated machine learning methods based on latent Dirichlet allocation (a Bayesian model in which topics are represented as overlapping probability distributions over words) but that the added complexity didn't get them better results.
The resulting graph (still weighted by its similarity scores) is then laid out using multidimensional scaling, augmented by a technique of the authors from Graph Drawing 2008 to reduce the node overlaps that would otherwise happen. They use methods based on modularity to partition the graph into clusters, and then apply the "Gmap" approach from one of the co-authors in PacVis 2010 to roughen the shapes of the cluster boundaries and choose colors so the drawing looks less like a graph and more like a cartographic map. The actual graph edges are not displayed, or are displayed only very lightly, I'm not sure.
Of course, if you're reading an active twitter stream, new tweets will be coming in all the time, and they don't want to drastically change the layout every time this happens. When their algorithm needs to incorporate new vertices, they seed the vertex placement algorithm with the old positions of the vertices, and then implement the multidimensional scaling part of the algorithm by local hill climbing. They found that this tends to keep the relative positions of the vertices fixed, but their absolute positions may shift, so they use a Procrustes transformation to match it back up with the earlier position, allowing only rotations and translations but not scaling to prevent changes in vertex spacing. There's also a lot of additional complexity and I think theoretical innovation as well in this part of the system in dealing with packing together the clusters of a disconnected graph into a single graph, and keeping that packing both valid and close to its earlier positions after each change.
Congratulations, Emden, Yifan, and Stephen!