[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[Groff] What's missing for Unicode support of groff?
From: |
Bruno Haible |
Subject: |
[Groff] What's missing for Unicode support of groff? |
Date: |
Thu, 7 Jul 2005 13:36:36 +0200 |
User-agent: |
KMail/1.5.4 |
You probably shuddered and laughed when you saw the hacks contained in
groff-utf8.tar.gz, but it shows which areas need work before groff can
handle man pages in Japanese and Vietnamese by default.
1) Recognition of the input file encoding.
2) The font system and the utf8/html devices.
3) Rendering and the other devices.
1) Currently on a Linux system you find man pages in the following encodings:
- ISO-8859-1 (German, Spanish, French, Italian, Brasilian, ...),
- ISO-8859-2 (Hungarian, Polish, ...),
- KOI8-R (Russian),
- EUC-JP (Japanese),
- UTF-8 (Vietnamese),
- ISO-8859-7, ISO-8859-9, ISO-8859-15, ISO-8859-16 (man7/*),
and none of them contains an encoding marker.
The agreement was to recognize the encoding according to a note in the
first line
'\" -*- coding: EUC-JP -*-
groff will then emit errors when it is fed input that is non-ASCII and
without coding: marker, so that man page maintainers are notified that
they need to add the coding: marker.
2) The font system of groff was designed for devices where groff has to
map each character to a font. However, for the utf8 and html devices,
this is not the case: here groff has to skip this step. The current
font system has not been updated and is therefore in the way:
- Characters that are not mentioned in the "charset" section of
the font files for these devices are dropped from the output. This
is wrong.
- If the "charset" section of each font file would contain 1 million
of Unicode characters, the initialization time of 'troff' and of
the postprocessors would be prohibitively high.
IMO, the solution is to
- remove the "charset" section of the font files for utf8 and html,
- split the "font" C++ class into a class hierarchy
class font; // abstract
class concrete_font: font; // useful for other devices,
// with "charset" section
class algorithmic_font: font; // useful for utf8, html devices,
// without "charset" section,
// determines the width of each
// character algorithmically.
3) For devices such as DVI, PS, X100, implement rendering of composed
characters, for bidi languages (Hebrew, Arabic, Farsi) and for Indic
languages (with vowel reordering).
The obvious king's path for this is to use GNOME's pango.
I know that work has begin on 3). Since for languages such as Chinese and
Russian - "Unicode level 1" -, only 1) and 2) are needed, my priority would
be on 1) and 2). I.e. I volunteer to work on that.
Werner says:
> Something like this [Tomohiro Kubota's iconv preprocessor, i.e. 1)]
> should become part of groff as soon as it supports Unicode on the input
> side.
What else is needed to support Unicode on the input side?
Bruno
- [Groff] What's missing for Unicode support of groff?,
Bruno Haible <=