LaTeX Unicodifier

While playing with some other code today I created a new little OS X mini-application: LaTeX Unicodifier. It converts between unicode strings with accented characters (e.g., “Mihai Pătraşcu”) and LaTeX source code for producing those strings (e.g., {``}Mihai P{\u{a}}tra{\c{s}}cu{''}). So if you've ever seen a name with strange accents and wondered how to type it in LaTeX, or you're familiar with LaTeX markup and want to get a unicode version of the name to paste into your blog entries, this may be for you.

Caveats: I've only tested this on one machine, and that not very thoroughly, so there are probably bugs. It won't run without a recent version of Python and PyObjC, which come preinstalled on version 10.5 of the OS but not on earlier versions.

If you dig into the application (applications on the Mac are really folders) you will find the source code, the bulk of which is this codec. I'd be especially interested to learn of gaps or errors in the codec. It would probably be easy for someone with a Python-based web server to use the same codec to produce a web page that does the same thing as the app for people without OS X, but since I wrote this primarily for myself I haven't bothered.

Comments:

mcfnord:
2008-08-09T09:59:57Z
work it!

11011110:
2008-08-09T15:18:48Z
Yeah, I'm sure there are a lot more effective ways of advertising a new app than mentioning it only in my own LJ, but I don't want to put a lot of effort into that. It's not like it's going to get me any money or academic brownie points. But it seemed like the sort of thing that some of my other readers might find useful, so...

gareth_rees:
2008-08-09T11:52:15Z

A comment on coding style. The module would be easier to read, check and maintain if you used character names instead of code points in the latex_equivalents dictionary. For example, this section:

0x01c4: "{D\\v{Z}}",
0x01c5: "{D\\v{z}}",
0x01c6: "{d\\v{z}}",
0x01c7: "{LJ}",
0x01c8: "{Lj}",
0x01c9: "{lj}",
0x01ca: "{NJ}",
0x01cb: "{Nj}",

could be rewritten like this:

ord(u'\N{LATIN CAPITAL LETTER DZ WITH CARON}'):                    '{D\\v{Z}}',
ord(u'\N{LATIN CAPITAL LETTER D WITH SMALL LETTER Z WITH CARON}'): '{D\\v{z}}',
ord(u'\N{LATIN SMALL LETTER DZ WITH CARON}'):                      '{d\\v{z}}',
ord(u'\N{LATIN CAPITAL LETTER LJ}'):                               '{LJ}',
ord(u'\N{LATIN CAPITAL LETTER L WITH SMALL LETTER J}'):            '{Lj}',
ord(u'\N{LATIN SMALL LETTER LJ}'):                                 '{lj}',
ord(u'\N{LATIN CAPITAL LETTER NJ}'):                               '{NJ}',
ord(u'\N{LATIN CAPITAL LETTER N WITH SMALL LETTER J}'):            '{Nj}',
ord(u'\N{LATIN SMALL LETTER NJ}'):                                 '{nj}',

But maybe you're worried about the computational cost of loading the latex.py module?

Also, is '\n' really the LaTeX equivalent of U+000a? I would have expected '{\\newline}'. Maybe this depends on the use cases you have in mind.

11011110:
2008-08-09T14:42:48Z

Thanks!

The main intended purpose of this codec was to be able to edit BibTeX files in Unicode and end up with a usable BibTeX file that TeX can still read afterwards (I used to have a program to do this but it only handled Macroman and stopped working after OS 9 died; I'm working on a replacement). But in that case it's reasonably common to have newlines in the longer BibTeX fields (such as a note or an abstract) and rare to want that to turn into an actual line break in the compiled results.

gareth_rees:
2008-08-09T15:04:54Z

You can automate the replacement of codepoints with names using the unicodedata module:

import latex
import unicodedata
for k, v in latex.latex_equivalent.items():
    print "   ord(u'\\N{%s}'): %s," % (unicodedata.name(unichr(k)), repr(v))

11011110:
2008-08-09T15:16:56Z

Thanks, that makes it a lot more likely that I would do this change.

11011110:
2008-08-19T04:27:49Z

I suppose there's some deeply important historical/political/linguistic reason that the unicode people spell λ as lamda rather than as lambda. I edited this module again to add better Greek handling and decided to implement your change along with it, and that weirdness was what caused me the most trouble debugging it.

Turns out that even with the change the module load time is not a problem.

gareth_rees:
2008-08-19T17:42:50Z

The (monotonic) Greek portion of Unicode derives from the 8-bit ISO 8859-7 character set. ECMA-118 (1986) is the European version of this standard and you can see in that document the name LAMDA for λ (also KSI for ξ and KHI for χ). I guess these are Latinizations of the modern Greek names for the letters.

ALEF and BET are another pair of names that may be surprising to mathematicians.

None: Similar problem in KBibTeX
2008-08-11T07:33:19Z

Hello,
as the author of KBibTeX I had the same problem. Currently I'm using a lookup table including regular expressions and support the conversion in both directions. You may want to have a look on the source code (C++, not Python): http://svn.gna.org/viewcvs/kbibtex/trunk/src/libkbibtexio/encoderlatex.cpp?rev=19&view=markup