bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Bruno Haible
Subject: Re: horrible utf-8 performace in wc
Date: Thu, 8 May 2008 15:07:00 +0200
User-agent: KMail/1.5.4

Pádraig Brady wrote:
> mbstowcs doesn't canonicalize equivalent multibyte sequences,
> and so therefore functions the same in this regard as our
> processing of each wide character separately.
> This could be considered a bug actually- i.e. should -m give
> the number of wide chars, or the number of multibyte chars?
> With the attached patch, `wc -m` gives 23 chars for both these lines.

The behaviour of "wc -m" is specified by POSIX [1] to output the "number
of characters". And:
  LC_CTYPE
    Determine the locale for the interpretation of sequences of bytes of text
    data as characters (for example, single-byte as opposed to multi-byte
    characters in arguments and input files) and which characters are defined
    as white space characters.

The definition of "Character" in [2] means a multibyte-character. IMO it
cannot be interpreted to mean a glyph, or a grapheme cluster, or a screen
column. Rather, it is the unit that is processed by a call to mbtowc [3] or
mbrtowc [4].

As a consequence:
  - The number of characters is the same as the number of wide characters.
  - "wc -m" must output the number of characters.
  - In a Unicode locale, <U00E9> is one character, and <U0065><U0301> is
    two characters,
    * even if they are canonically equivalent (because POSIX does not make
      reference to this concept), and
    * even if they render the same on the screen (because except for Curses,
      POSIX does not refer to the rendering of characters).

If you want wc to count characters after canonicalization, then you can
invent a new wc command-line option for it. But I would find it more useful
to have a filter program that reads from standard input and writes the
canonicalized output to standard output; that would be applicable in many
more situations.

Bruno

[1] http://www.opengroup.org/susv3/utilities/wc.html
[2] http://www.opengroup.org/susv3/basedefs/xbd_chap03.html
[3] http://www.opengroup.org/susv3/functions/mbtowc.html
[4] http://www.opengroup.org/susv3/functions/mbrtowc.html





reply via email to

[Prev in Thread] Current Thread [Next in Thread]