I recently covered strong connectivity analysis in my graph algorithms class, so I've been playing today with applying it to the link structure of (small subsets of) Wikipedia.

For instance, here's one of the strong components among the articles linked from Hans Freudenthal (a mathematician of widely varied interests): Algebraic topology, Freudenthal suspension theorem, George W. Whitehead, Heinz Hopf, Homotopy group, Homotopy groups of spheres, Humboldt University of Berlin, Luitzen Egbertus Jan Brouwer, Stable homotopy theory, Suspension (topology), University of Amsterdam, Utrecht University. Mostly this makes sense, but I'm not quite sure how the three universities got in there. Maybe from their famous faculty members?

It's limited to relatively small graphs by the time it takes to grab the data from the Wikipedia servers, but the code is surprisingly easy. Given a set of articles, here's all it takes to build the graph:

def blueLinksFrom(article):
    """Which Wikipedia articles are linked from the given article?"""
    api = "http://en.wikipedia.org/w/api.php?action=parse&page=" + \
          urllib.quote(article.encode('utf-8')) +\
    linkdata = json.loads(urllib2.urlopen(api).read())
    linkarray = linkdata['parse']['links']
    return {L['*'] for L in linkarray if 'exists' in L and ':' not in L['*']}

graph = {v:blueLinksFrom(v)&articles for v in articles}

Vaguely relatedly (well, it's about graphs and Wikipedia, if nothing else) I discovered two things about Halin graphs today:

  1. Like many things in mathematics, they're named after the wrong person. Halin graphs (or at least, 3-regular Halin graphs) actually go back to the work of Thomas Kirkman, over 100 years before Halin.

  2. For about the past 3 1/2 years, Wikipedia may have been using the wrong Halin reference, copied from MathWorld. I can't tell for sure because I don't read German, but the 1964 paper it was citing (and that MathWorld still cites) seems to be about other topics. More reliable publications in this area instead cite a different paper by Halin, published in 1971.



I'm likely not about to say something you haven't already thought of, but you can work around the network limitation using the full XML dump of "latest pages" from wikipedia (http://en.wikipedia.org/wiki/Wikipedia:Database_download). About 9 GB to download initially in bzip2 format. Not a great deal more code in Python, although you have to use streaming XML processing and it's more practical to pre-process and write out the graph (page titles and links) to disk and then analyse the structure.


That might be interesting to try too. I was thinking, though, that getting small subsets that are very fresh would be a better fit for what I was interested in using this for: working on an article or a narrow category and finding subsets of its links that should be connected to each other but aren't.


Yeah, that makes sense. I haven't found a practical way to incrementally update a download like that, so I only do the full download once or twice a year. Probably not good enough for contemporary editing, but a fun dataset for trying out algorithms.


It doesn't surprise me that Kirkman was ignored here. He was basically a crazy priest off in an isolated church in the north of England with nothing to keep him company but his combinatorics, his idiosyncratic but internally consistent nomenclature ("Every polyedron P, not a pyramid, has either a convanescible or an evanescible edge."), and his Aspberger's syndrome. Reading Kirkman's papers is like reading the phonebook, or the execution trace of an automatic theorem prover. Cayley tried to edit his papers into human-readability, but eventually just gave up (leading to the "falling out" referenced in Wikipedia). After that, Kirkman published his stuff in the equivalent of the problems section of the Monthly. He'd be forgotten entirely if he hadn't actually come up with some groundbreaking results.

As authorities and analogy are alike divided about the spelling of the word polyedron, I have pleased myself herein. Why polyhedron of necessity, and yet not perihodic?
— Thomas P. Kirkman. On the Representation of Polyedra. Phil. Trans. Royal Soc. London 146:413–418, 1856.