groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

The mess of charsets (was Re: [Groff] [PATCH] compiling error on grolbp


From: Gael Queri
Subject: The mess of charsets (was Re: [Groff] [PATCH] compiling error on grolbp with latest CVS update)
Date: Tue, 13 Jun 2000 17:33:33 +0200
User-agent: Mutt/1.1.11i

On Sun, Jun 11, 2000 at 06:21:13AM +0000, Werner LEMBERG wrote:
> 
> > PS: what do you think about my latin9 patch? is 8-bit char encoding
> > considered obsolete?
> 
> Your patch is OK, but I don't know how to proceed.  Creating more and
> more directories for the various encodings + devices isn't a valid
> solution, so I will postpone your changes.

Well, sure; We should really separate that...

> A guy is trying to make Unicode input possible (BTW, any results
> meanwhile?), and I will try to remove input encoding stuff from the
> font description files.

And did someone look at the way plan9's troff works? Maybe we
could find some ideas there as this system uses 16-bit chars everywhere.
It can be downloaded w/source at http://plan9.bell-labs.com/
If someone can easily afford a 50 Mo download...

> I'll probably introduce a kind of `fonts.dir' file to register the
> (font) encoding of the fonts for a particular device to have the
> following chain:
> 
>   groff glyph -> lookup the device's fonts.dir
>               -> search a font which covers the input encoding
>               -> check font shape, slant, etc.
>               -> if no results, check the special fonts
> 
> For example, having `devX100' and `latin2', let's assume that the font
> for shape `Roman bold italic' is called TBI2.  In fonts.dir, this
> would be mapped to `TBI', and everything works as expected.

Yes, it seems appropriate for X output because there the fonts
actually have their charset embedded in their LFD name.
And unicode fonts start to show, so we can
easily fall back to them. (But these are not the typical fonts
used by groff so the output may look ugly :( )

Also for PostScript I suppose the situation is similar.

For html utf8 is more and more recognized by browsers but we
should not rely on this added complexity. And it's almost impossible
to write a multi-encoding html file so we have to choose an encoding.

Also for the ttys the situation is similar: we simply can't
change the font; we have to say which charset we're using.
Sure, we could just use utf8 output but I don't know of any easy way
to recode it to any charset. libc's iconv/recode/etc don't know
what to do when one character is not in the current charset.

What we need is a way to convert utf8 to whatever charset
conserving as much as possible. Actually lynx has
something similar.

Also for dvi output the situation is a bit different because there
we are able of making our own chars, ie if I want a Y with a
dieresis I can simply ask for {\"Y} (or something like that...)
but I don't know how to deal with non-latin encodings.

> Such an approach of course assumes that a particular font covers a
> given input encoding, but anything else would be much more
> complicated.

Sure. i18n is certainly the point where unix sucks the most :(
UTF-8 is theoritically the best solution but almost no one is able
of processing it with most devices.

And there is also the problem that most input files
doesn't have any encoding specified, so we should really add
a way to specify the input encoding.
And the devices are also too heterogen: latin1 is the only device
to be usable for other charsets because it leaves most of the characters
alone, but all other devices suppose latin1 input (especially -Thtml
which converts it to sgml entities)  


Hope I'm not too unclear :(

        Regards, Gaƫl


reply via email to

[Prev in Thread] Current Thread [Next in Thread]