bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Bruno Haible
Subject: Re: horrible utf-8 performace in wc
Date: Thu, 8 May 2008 15:20:54 +0200
User-agent: KMail/1.5.4

> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
>                             linepos += width;
>                           if (iswspace (wide_char))
>                             goto mb_word_separator;
> +                         else if (uc_combining_class (wide_char) != 0)
> +                           chars--; /* don't count combining chars */
>                           in_word = true;
>                         }
>                       break;

If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m'
is not specified to behave like this, see the other mail), then
uc_combining_class from gnulib is a usable API.

However, in this patch you are assuming an UTF-8 locale. Recall that on some
systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character
representation of a double-byte character is unrelated to Unicode: the mbrtowc
routine just combines the two bytes in a single wchar_t with a bit of shifting
and masking; no conversion to Unicode takes place here.

If you want to convert a byte sequence from the locale's encoding to a
sequence of Unicode characters, in order to use uc_combining_class and similar
API, you can do so through the gnulib function u32_conv_from_encoding
(using locale_charset() as encoding). It's defined in gnulib's "uniconv.h" file.

Bruno





reply via email to

[Prev in Thread] Current Thread [Next in Thread]