Re: horrible utf-8 performace in wc

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc

From:	Bruno Haible
Subject:	Re: horrible utf-8 performace in wc
Date:	Thu, 8 May 2008 15:20:54 +0200
User-agent:	KMail/1.5.4

> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
>                             linepos += width;
>                           if (iswspace (wide_char))
>                             goto mb_word_separator;
> +                         else if (uc_combining_class (wide_char) != 0)
> +                           chars--; /* don't count combining chars */
>                           in_word = true;
>                         }
>                       break;

If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m'
is not specified to behave like this, see the other mail), then
uc_combining_class from gnulib is a usable API.

However, in this patch you are assuming an UTF-8 locale. Recall that on some
systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character
representation of a double-byte character is unrelated to Unicode: the mbrtowc
routine just combines the two bytes in a single wchar_t with a bit of shifting
and masking; no conversion to Unicode takes place here.

If you want to convert a byte sequence from the locale's encoding to a
sequence of Unicode characters, in order to use uc_combining_class and similar
API, you can do so through the gnulib function u32_conv_from_encoding
(using locale_charset() as encoding). It's defined in gnulib's "uniconv.h" file.

Bruno

[Prev in Thread]

Current Thread

[Next in Thread]

horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/06
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
  - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
    - Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
    - Re: horrible utf-8 performace in wc, Bruno Haible <=
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
  - Re: horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/07
  - Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
    - Re: locales for testing, Bruno Haible, 2008/05/08
  - Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/08
  - Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
    - Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/08

Prev by Date: Re: horrible utf-8 performace in wc
Next by Date: Re: locales for testing
Previous by thread: Re: horrible utf-8 performace in wc
Next by thread: Re: horrible utf-8 performace in wc
Index(es):
- Date
- Thread