[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

tr is handling bytes not characters

From: Nick Demou
Subject: tr is handling bytes not characters
Date: Thu, 5 Feb 2009 13:20:28 +0200

First many THANKS for all this work you've done for all of us.

And now about the bug report. It's about "tr". I realized that tr was
mostly failing when working on utf-8 input. Although my personal point
of view is that this is a critical missing feature[1] what I consider
a solid bug is that the documentation fails to give a clue of the
situation. The basic problem is the use of the ambiguous word
"character" instead of the more appropriate "ASCII character" in the
man and info pages. Eg. here's some text from the man page:

#     tr - translate or delete characters
#     Translate, squeeze, and/or delete characters from standard
input, writing to standard output.

My sugestion is to do either one of the following (or maybe both)
 a) replace at least the first if not all appearances of the word
character with ASCII character
 b) place a clear warning like the following in every page:
    WARNING: the use of the word character in this document refers to
ASCII characters (i.e. bytes). In other word this program does NOT
support multi-byte characters. If for example, you're passing utf-8
encoded input and/or parameters, then the program may behave in a
non-intuitive way because it will process each byte of a multibyte
utf-8 character as an independent character .

Also a similar reference in the coreutils faq is IMHO really suitable.

If there is anything I can do to help don't hesitate to ask (not
however that linux / C programing is unfortunately not part of my

Nick Demou

[1] I consider it critical because most if not all GNU-Linux
distributions use UTF-8 as the default (and it seems that there is a
widespread assumption that GNU-Linux UTF-8 support is quite mature).

"The software is licensed, not sold" -- MICROSOFT LICENSE TERMS

reply via email to

[Prev in Thread] Current Thread [Next in Thread]