[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Bruno Haible |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Thu, 8 May 2008 15:20:54 +0200 |
User-agent: |
KMail/1.5.4 |
> @@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
> linepos += width;
> if (iswspace (wide_char))
> goto mb_word_separator;
> + else if (uc_combining_class (wide_char) != 0)
> + chars--; /* don't count combining chars */
> in_word = true;
> }
> break;
If you want a tool to ignore combining characters (not 'wc -m', since 'wc -m'
is not specified to behave like this, see the other mail), then
uc_combining_class from gnulib is a usable API.
However, in this patch you are assuming an UTF-8 locale. Recall that on some
systems (Solaris, FreeBSD, ...) in EUC-JP locale for example, the wide-character
representation of a double-byte character is unrelated to Unicode: the mbrtowc
routine just combines the two bytes in a single wchar_t with a bit of shifting
and masking; no conversion to Unicode takes place here.
If you want to convert a byte sequence from the locale's encoding to a
sequence of Unicode characters, in order to use uc_combining_class and similar
API, you can do so through the gnulib function u32_conv_from_encoding
(using locale_charset() as encoding). It's defined in gnulib's "uniconv.h" file.
Bruno
- horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/06
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
- Re: horrible utf-8 performace in wc,
Bruno Haible <=
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
Re: horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/07
Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08