groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Re: Unicode, EBCDIC, Latin-2, JIS for groff


From: Eric Fischer
Subject: [Groff] Re: Unicode, EBCDIC, Latin-2, JIS for groff
Date: Fri, 10 Mar 2000 14:40:58 -0600 (CST)

> Question: How far is the project of Unicode input?

Here's what I've done so far:

  *  The file iterator recognizes valid UTF-8 patterns in the input,
     and when they are encountered they get transmuted into \U'number'.
     Latin-1 characters (which is to say, eight-bit characters that are
     not part of a legal UTF-8 sequence) are also temporarily translated
     into \U sequences; ASCII characters are passed through unchanged.

     Accepting Latin-2 or whatever based on a command line option would
     be easy to add; accepting EBCDIC would also be easy if everyone could
     agree on what EBCDIC characters should map to what Unicode characters.

  *  The tokenization routine recognizes \U and converts anything outside
     the range 0x00 to 0xFF into \[char0xNNNN] or \[char0xNNNNNNNN] as
     appropriate.

     This makes non-Latin1 characters second-class citizens (they can't be
     used in the names of macros, etc.), but I was intimidated by the task
     of finding every place in the program that depends on characters being
     at most eight bits wide.

  *  An extension to the ligature mechanism joins Unicode combing accents
     to their base characters as a single character whenever possible.

  *  I've been working on more general support for accents (for the cases
     where there isn't a single Unicode character that represents the
     accented letter or where a character has multiple accents) but this
     doesn't work very well yet.

  *  I haven't done anything with right-to-left or reordered characters.
     As I understand it, Plan 9 troff doesn't support these (or combining
     accents) either.

> Additionally, I suggest to use UTF8 exclusively as the external
> encoding representation if, say, the command line option `-u' is used.

If you want the *output* to be UTF-8 as well as the input, this is also
going to require changes to all the postprocessors.  It is what Plan 9
troff does, though.

eric


reply via email to

[Prev in Thread] Current Thread [Next in Thread]