groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] Re: The mess of charsets


From: Werner LEMBERG
Subject: [Groff] Re: The mess of charsets
Date: Tue, 13 Jun 2000 19:49:32 +0000 (GMT)

> And did someone look at the way plan9's troff works?

Indeed, checking this would be a good thing.

> It can be downloaded w/source at http://plan9.bell-labs.com/
> If someone can easily afford a 50 Mo download...

Unfortunately, I can't :-(

> > I'll probably introduce a kind of `fonts.dir' file to register the
> > (font) encoding of the fonts for a particular device to have the
> > following chain:
> > 
> >   groff glyph -> lookup the device's fonts.dir
> >               -> search a font which covers the input encoding
> >               -> check font shape, slant, etc.
> >               -> if no results, check the special fonts
> > 
> > For example, having `devX100' and `latin2', let's assume that the
> > font for shape `Roman bold italic' is called TBI2.  In fonts.dir,
> > this would be mapped to `TBI', and everything works as expected.
> 
> Yes, it seems appropriate for X output because there the fonts
> actually have their charset embedded in their LFD name.

This is unimportant, I think, except for a script which adds fonts to
groff -- groff will see its own `fonts.dir'-like file and is not
dependent on any external representation.

> For html utf8 is more and more recognized by browsers but we should
> not rely on this added complexity.  And it's almost impossible to
> write a multi-encoding html file so we have to choose an encoding.

HTML files with multiple encodings are deprecated, AFAIK -- it is very
complicated to construct such a beast anyway.

But HTML is special since everything can be represented as an SGML
entity in HTML 4.0 -- theoretically, we can have a Unicode encoded
document completely written in ASCII...

> Also for the ttys the situation is similar: we simply can't
> change the font; we have to say which charset we're using.

Thus my idea to first try the input encoding specified during a groff
call -- something like

  groff --input-encoding=latin-2 --device=tty ...

> Sure, we could just use utf8 output but I don't know of any easy way
> to recode it to any charset.  libc's iconv/recode/etc don't know
> what to do when one character is not in the current charset.

I don't plan to support encoding conversions.  There are a lot of
filters which can be applied before calling groff (or in case of ttys,
after calling groff).

> What we need is a way to convert utf8 to whatever charset conserving
> as much as possible.  Actually lynx has something similar.

Why should we do this?  With the new version of X, xterm has built-in
utf8 support.  Converting utf8 to anything else is doomed to fail in
most cases.

And be careful to distinguish between `input encoding' and `font
encoding'.  Even with Unicode as the input encoding, groff will still
use *glyph names* for output if necessary.

> Also for dvi output the situation is a bit different because there
> we are able of making our own chars, ie if I want a Y with a
> dieresis I can simply ask for {\"Y} (or something like that...)  but
> I don't know how to deal with non-latin encodings.

I think this is beyond the scope of groff.  The only chance to get
support for non-latin encodings IMHO is to support the extended DVI
format of Omega.

Especially for DVI I will probably add another switch to groff which
makes it possible to specify the font encoding:

  groff --input-encoding=latin-2 --font-encoding=T1 --device=dvi ...

Then we have the following mappings;

  input encoding -> glyph names -> font encoding

I've already implemented such things for my ttf2pk converter, so I
know quite well how to handle it...

> > Such an approach of course assumes that a particular font covers a
> > given input encoding, but anything else would be much more
> > complicated.
> 
> Sure. i18n is certainly the point where unix sucks the most :(

Windows is not much better -- just now you can e.g. write Thai with
the German version of Windows 2000 without using special Thai
software.

> UTF-8 is theoretically the best solution but almost no one is able
> of processing it with most devices.

groff can already use 16bit fonts!  There is no 8bit limit with the
font encoding.

UTF-8 as input encoding will be finally mapped to output glyph names.
A possible algorithm may be

  . Check whether we have a glyph named `U+xxxx' (or something
    similar) -- since, say, more than 90% of the characters in Unicode
    have a direct mapping to a single glyph, I believe it is a valid
    shortcut to directly use Unicode character codes in font
    encodings.

  . If not, consult a (not yet existing) mapping table which maps
    Unicode to glyph names.

It will become a bit more complicated for composed characters -- I
think we shall then follow the Adobe Glyph List (and its algorithm)
for constructing glyph names in case no mapping value is found.  Even
here, glyphs not having an Adobe glyph name get its names constructed
with Unicode names, e.g. `uni01B703020300' (which is a composite glyph
consisting of the three Unicode characters U+01B7, U+0302, and
U+0300).

> And the devices are also too heterogen: latin1 is the only device to
> be usable for other charsets because it leaves most of the
> characters alone, but all other devices suppose latin1 input
> (especially -Thtml which converts it to sgml entities)

This will change in the not-too-far future, I hope.


    Werner


reply via email to

[Prev in Thread] Current Thread [Next in Thread]