bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: tr is handling bytes not characters


From: Jim Meyering
Subject: Re: tr is handling bytes not characters
Date: Wed, 11 Feb 2009 07:20:16 +0100

Nick Demou <address@hidden> wrote:
> On Tue, Feb 10, 2009 at 12:59 PM, Jim Meyering <address@hidden> wrote:
>> Nick Demou <address@hidden> wrote:
>>> [...]
>>> Thanks for the info Eric. I was almost sure this would be the case. In
>>> fact I don't consider this as the main topic of my bug report. The
>>> main topic for me is the documentation. The man and info page don't
>>> make it clear that utf-8 is not supported. I believe that others after
>>> me will spend a lot of time just to realize that "it's just a missing
>>> feature".  Do you have any thoughts regarding my suggestions on the
>>> documentation?
>>
>> The "real" documentation is in coreutils.texi (generated to
>> coreutils.info and available via "info coreutils").  There,
>> under "tr invocation", it already has this caveat:
>
> oops, mea culpa
> I did read carefully the man page and then I did search coreutils info
> before submitting this bug report. However I only searched for "utf"
> and "unicode" so I missed the warning which doesn't contain any of the
> two strings
>
>> and since "man tr" does point to the authoritative source [the info pages]:
>> [...]
>> that may be enough.
>
> I think it is for English speaking users but not for non-English
> speaking ones who have to deal with actual[1] UTF8 text often. I would
> suggest the following small corrections:
>
> A. for the info page
> ====================
>
> add a direct reference to UTF-8 and Unicode like this:
>
> from:
> #   Currently `tr' fully supports only single-byte characters.
> # Eventually it will support multibyte characters;
>
> to:
> #   Currently `tr' fully supports only single-byte characters.
> # Eventually it will support multibyte characters (e.g. UTF-8
> # or UTF-16 encoded Unicode characters);
>
> B. for the man page
> ===================
>
> add a reference like this:
>
> #  Currently `tr' fully supports only single-byte characters.
> # (a notable example of multibyte characters that are not
> # supported are UTF-8 and UTF-16 encoded Unicode characters)

The trouble is that if you use certain distributions,
(as opposed to upstream sources)
that would not be true, since they have patches that add multibyte
support.  Of course, then, the affected distros would have to be
sure to remove your just-added patch.  So it's feasible.

Ok.  For A and B, please send patches.
Details on procedures to do that are here
  http://git.sv.gnu.org/cgit/coreutils.git/plain/HACKING
For A, you'd modify doc/coreutils.texi.
For B, you'd modify src/tr.c's usage function,
since its --help output is converted automatically to the man page.
For tr.c, try to keep it as succinct as possible,
and add a new, separate fputs statement containing just your new
line or two.  i.e., don't append to an existing string.

> C. for the core utils FAQ
> =========================
>
> add a Question like this one:
>
> # Q: What's the status of Unicode support.
>
> (for which I cannot suggest a thorough answer although I could try and
> dig something out of the current documentation if noone else is able
> to help at the moment)
>
> or
>
> # Q: I get funny/no/wrong results when dealing with
> #    UTF-8/Unicode input
>
> # A: UTF-8 and UTF-16 encodings for Unicode text is made up
> #    of multibyte characters which are not well supported
> #    by some coreutils programs.

in upstream sources.
They *are* supported in certain patched (distribution-specific) versions
of coreutils.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]