groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Re: groff: radical re-implementation


From: Tomohiro KUBOTA
Subject: [Groff] Re: groff: radical re-implementation
Date: Tue, 17 Oct 2000 10:27:12 +0900
User-agent: Wanderlust/1.0.3 (Notorious) SEMI/1.12.1 ([JR] Nonoichi) FLIM/1.12.7 (YĆ«zaki) Emacs/20.7 (i386-debian-linux-gnu) MULE/4.1 (AOI)

Hi,

At Mon, 16 Oct 2000 16:41:35 +0200 (CEST),
Werner LEMBERG <address@hidden> wrote:

> [I'm CC'ing this mail to the groff@ mailing list.  May I ask to move
> the discussion about improvments/changings of groff to this list?]

Ok, I joined address@hidden mailing list, though I send this message
also for debian-i18n list to inform that I agreed to move.


>> The ideal implementation will be using 'wchar_t' for reading.
> But this will fail for some compilers...

Now wchar_t is supported by many systems.  It is a mandatory for
internationalization.

The merit of wchar_t is that:  write once and work for every
encodings, uncluding UTF-8.  Otherwise, you have to write
similar source codes many times for Latin-1, EBCDIC, UTF-8,
and so on so on.  Especially, I will insist that Groff should
support EUC-* multibyte encodings for CJK languages.  This is
what the current Groff cannot handle entirely.  (CJK people
also uses ISO-2022-* encodings.)

The other merit of wchar_t is user-friendliness.  Once a user
set LANG variable, every softwares work under the specified 
encoding.  If not, you have to specify encodings for every software.
We don't want to have ~/.groffrc, ~/.greprc, ~/.bashrc, ~/.xtermrc,
and so on so on to specify 'encoding=ISO8859-1' or 'encoding=UTF-8'.



>> Abolish device types of 'ascii', 'ascii8', 'latin1', 'nippon', and
>> 'utf8' and introduce a new device type such as 'tty'.

I suppose you don't know about 'ascii8' device.  This is a local
patch for Debian's Groff that is 8-bit clean (like latin1) but
doesn't assume that 8-bit part is latin1 encoding.  For example,
'-' is used for hyphenation and '\(co' is converted into '(C)'.
This is for 8-bit encodings other than latin1, i.e., ISO8859-2,3,..,
and KOI8-R.  (Not for CJK multibyte languages).



> Please bear in mind that groff shall work on non-GNU systems also!  My
> idea is to only accept UTF8, ascii, latin1, and ebcdic as input
> encodings (the latter three for historical reasons only).

I wrote about Glibc because the message is to Debian mailing list.
Of course I think of portability.  wchar_t is portable.  I recommend
to implement wchar_t as a new architecture and ascii, latin-1, and
ebcdic as historical encodings.  (We may add 'UTF8' as a historical
one.)

I think what is 'historical' is systems which don't support wchar_t.



> Maybe on systems with a recent glibc, iconv() and friends can be used
> to do more, but generally I prefer an iconv-preprocessor so that groff
> itself has not to deal with encoding conversions.

I think this works well.  However, who invokes iconv-preprocessor?
A user or wrapper-software?  What determines the command option for
iconv?



>> - Groff assumes the input as the encoding of current locale.
> This is probably not correctly set everywhere.

How a system can be configured by a user, in ways other than locale?
A user who want to specify his/her language and encoding will set 
LANG variable.  Or, having many ~/.foobarrc for every softwares or 
specifying --encoding=foobar everytime (s)he invokes a software?  
I think setting LANG is a reasonable way.


One compromise is that:
 - to use UCS-4 for internal processing, not wchar_t.
 - a small part of input and output to be encoding-sensible.
 - command options for encodings of input and output to be added.
 - a compile-time option I18N to be introduced.
 - when I18N is off, default input is latin-1 and default output
   is also latin-1.
 - when I18N is on, default input and default output are sensible
   to LC_CTYPE locale.
 - Of course these default encodings can be overrided by command
   options.
 - Groff can be compiled with I18N off for systems without 
   internationalization functions such as setlocale().
 - iconv(3) to be used for converting between input/output encodings
   and internal UCS-4 encoding, if available (I18N=true).
 - if I18N is false, conversion process to be hard-coded for
   Latin-1, EBCDIC, and UTF-8.

Do you think this can be achieved?

---
Tomohiro KUBOTA <address@hidden>
http://surfchem0.riken.go.jp/~kubota/

reply via email to

[Prev in Thread] Current Thread [Next in Thread]