Although originally developed for managing large software systems, version control systems can be very helpful in writing scientific papers: they provide mechanisms for managing and tracking revisions to papers, allow multiple authors to work on the same files at once without having to take turns playing "who has the token" email games, provide easy mobility of content between different machines at home and in the office, and by replicating your content on multiple machines provide a form of backup making data losses less likely. And anyone who's recently written an NSF proposal knows that the NSF now requires a data management plan, and version control is an important part of any such plan.
Until now, I've been using a CVS for version control for my single-author papers and most of my within-UCI collaborations, and SVN for some collaborations with other co-authors who have set up their own SVN repositories. But both CVS and SVN are getting old and creaky, so this weekend I started playing with Git instead.
In the context of academic writing, I think switching to Git will have some important advantages:
The distributed version control model means that it will be possible to work offline (e.g. in an airplane) and still have access to the whole version history, not just the latest version. And for the same reason the whole history is backed up by multiple replicated copies, not just the most recent version. The distributed model also solves the problem of "do we host this at my institution or yours": we can do both!
Software such as Gitolite should make it possible to manage co-authors from other institutions and give them access to shared master copies of Git repositories without having to create departmental logins for them and without having to deal with managing an Apache installation. And tools such as cvs2git should make it possible to transfer all our old history to the new system seamlessly.
The default setup for Git repositories doesn't have the cumbersome branches/tags/trunk subdirectory structure that SVN has, which I never found to be particularly useful. Instead tags are handled as separate first-class objects in Git. I hadn't been using tags much in CVS/SVN but I think in Git they should be a good way of tracking major events in the lifetime of a paper such as submission to a journal or uploading to a preprint server.
Git is being actively maintained by the open source community and is growing in popularity (e.g. see Wikimedia's move from SVN to Git) making it likely that it will continue to work well on whatever platforms I'm likely to use in the near to mid future.
I also looked at Mercurial and Bazaar, which are in many ways similar, but the greater popularity of Git was a winning factor for me.
Well, how do you the diffs and the merges?
You mean in my old workflow with cvs? Checkin usually does a reasonable job of merging changes from more than one editor as long as they're working on different lines of the same files, and on the rare occasion when it gives up and forces you to do the merge by hand it's usually not too difficult. And there's a cvs diff command.
As for how to view diffs etc in git, there are similar commands, but I've also been looking at SourceTree for a graphical user interface which among other things can show the diffs for all checkins.
I meant source control in general: how do you do merges on something like a word document, not simply a text file. I've never done it, but it seems it will be unpleasant.
Well, I don't write my papers in Word, I write them in LaTeX. So just as in computer programming the source code is just a text file.
Figures are more of a problem but they don't tend to change much. And the compiled output of the LaTeX source code (a PDF file) is binary and does change frequently, but I usually leave it out of the version control because of that.
I wanna learn Mercurial. It doesn't have a central repository. It's magical!
Yeah, it's like git, but simpler.
I’ve been using Git for quite a while myself. Conceptually, it’s an amazing system. It baffles me how they could mess up the documentation and command line interface so much. Using Git pleases me as a computer science person and annoys me as an educator.
It took me a while to understand how to keep revision identifiers visible in my Latex documents: http://thorehusfeldt.net/2011/05/13/including-git-revision-identifiers-in-latex/ This is useful for identifying which revision of your source code produced the PDF you’re looking at. (Or, more realistically, you co-autor is looking at, or your students.) “Oh, you mean exercise 5b in revision 5ac25…? No, that doesn’t work, it’s actually NP-complete. Sorry. You need to redownload the exercises from the course web page.” or “I think the reviewer is looking at 34af35…. That contains the version of lemma 5 where Bob used Foo’s reduction in the wrong direction.”
An invaluable emacs hack is at thingsthatpassforknowledge.wordpress.com/2011/10/08/emacs-prettifies-plain-text-files-for-version-control/: Get auto-fill-mode to line break at sentence end. This way, minor edits affect only the current line, which increases the usefulness of Git’s diff (and other version control systems).
And, whimsically, for beginning Git users: Git achievements is a tiny wrapper that provides encouraging feedback when you start learning Git, levelling you up like in a video game.
Thanks for all the information and pointers. I mostly use TeXShop for my paper editing these days, rather than emacs — as well as having a user interface that's more native to the OS X look and feel, it has automatic synchronization between source and preview allowing me to click on one and get sent to the same position in the other. But it also ends up meaning that most paragraphs are formatted as a single line of text. Still, that means that re-filling doesn't cause spurious changes.
I don't know whether you noticed, but I blogged on my own platform choices for collaborations just the other day: http://blog.mikael.johanssons.org/archive/2012/02/collaborative-tools/
No, I missed that one — thanks for the link.