groff
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Groff] UTF-8 out-of-the box experience


From: Markus Kuhn
Subject: [Groff] UTF-8 out-of-the box experience
Date: Thu, 03 May 2001 09:47:53 +0100

Our department upgraded machines yesterday to the brand new Red Hat 7.1
release. Here a few impressions I collected while I demonstrated the
UTF-8 capabilities to my colleagues. UTF-8 locales are available now and

  % LANG=en_GB.UTF-8 xterm &

is all that is needed to enter the Unicode world.

I had to unset LESSCHARSET in some people's environment. It is obsolete
now and and if it is set, it just hinders less in autodetecting that
UTF-8 should be activated. I found in "man man" in the BUGS section the
tip "If  you  see blinking \255 or <AD> instead of hyphens, put
`LESSCHARSET=latin1' in your environment." This tip is obsolete now,
harmful and should definitely be removed.

I ran into a few embarrassing bugs that still haven't been fixed though
I think they have been mentioned here several times before.

The combination of "man" (version 1.5h) and "groff" (GNU troff version
1.16.1) is seriously broken in a UTF-8 locale. Even for ASCII only web
pages, groff inserts Latin-1 SHY bytes, which result in an ugly
malformed UTF-8 sequence. It is very disappointing that this doesn't
work correctly out-of-the-box, because the underlying groff mechanics
for UTF-8 output is already in place and seems to work correctly:

  zcat /usr/share/man/man7/groff_char.7.gz | groff -mandoc -Tutf8 - | less

produces the desired results, whereas

  man groff_char

does not.

The required fix here is that groff should get a new output device
-Tplaintext which specifies plaintext encoded according to the current
locale (just query nl_langinfo(CODESET) and see whether it says "UTF-8"
or "ISO-8859-*" or something like that). Then in /etc/man.config, we
could simply replace

  NROFF           /usr/bin/groff -Tlatin1 -mandoc

with

  NROFF           /usr/bin/groff -Tplaintext -mandoc

and man would automatically work properly in both ISO-8859 and UTF-8
locales.

"less" (less 358+iso247) is also still broken and completely messes up
in UTF-8 mode the handling of backspace boldification used by nroff.
This still distorts the output of any man page. Test case:

  perl -e 'use utf8; print "a\ba_\bb\n"' | less

correctly shows a bold "a" and an underlined "b", but

  perl -e 'use utf8; print "\x{20ac}\b\x{20ac}_\b\x{2203}\n"' | less

fails to show either a bold euro sign or an underlined there-exists sign.
(Perl 5.6 or newer required here)

UTF-8 locale support under X11 (XFree86 4.0.3) also seems still *very*
broken. For example, I would have hoped that

  perl -e 'use utf8; print "\x{20ac}"' | xmessage -file -

(all under LANG=en_GB.UTF-8) shows me a window with the euro sign, but
what I get instead is display of "â\202¬". :-(

I also tried vi quickly (VIM 6.0z ALPHA) with LANG=en_GB.UTF-8, but when
I used "vi UTF-8-demo.txt", I just got garbled text on the screen. man
vi did not contain the search string "uni" or "utf". Couldn't figure out
whether the vim 6.0z that comes with RH 7.1 has any UTF-8 support. It
certainly didn't work out-of-the-box.

Summary: Red Hat 7.1 is not even suited to make a 5 min demonstration of
its UTF-8 locale support without serious embarrassment. xterm is pretty
much the only UTF-8 application that works at the moment.

Required action:

- fix less backspace bug
- fix groff to support locale-dependent selection of output encoding
  (-Tplaintext or so)
- fix man.config to use groff -Tplaintext instead of -Tlatin1
- fix xman to use ISO10646-1 fontset when in UTF-8 locale such that
  groff_char man page is shown with all characters.
- make sure that LESSCHARSET is not set anywhere
- fix vi to activate UTF-8 mode in UTF-8 locale
- test the SUSE 7.2 beta to avoid the same problems there

Markus

-- 
Markus G. Kuhn, Computer Laboratory, University of Cambridge, UK
Email: mkuhn at acm.org,  WWW: <http://www.cl.cam.ac.uk/~mgk25/>


reply via email to

[Prev in Thread] Current Thread [Next in Thread]