[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: horrible utf-8 performace in wc
From: |
Pádraig Brady |
Subject: |
Re: horrible utf-8 performace in wc |
Date: |
Thu, 8 May 2008 02:06:08 +0100 |
User-agent: |
Thunderbird 2.0.0.6 (X11/20071008) |
Bo Borgerson wrote:
> Pádraig Brady wrote:
>> In the first 65535 code points there are also 404 chars which are
>> not classed as combining in the unicode database, but are classed
>> as zero width in the glibc locale data at least (zero-width space
>> being one of them like you mentioned). I determined this with the
>> attached progs:
>>
>> ./zw | python unidata.py | grep " 0 " | wc -l
>
>
> Hi Pádraig,
>
> Wow, I knew there were some stand-alone zero-width characters, but I had
> no idea there were so many!
I'm not sure should many of those be counted anyway.
But the combining class is all we have to go on.
>
> I poked around a little in gnulib and found a function for determining
> the combining class of a Unicode character.
>
> I think the attached patch does what you were intending to do, and it
> also counts all of the stand-alone zero-width characters you found:
cool, thanks.
Could you could optimize it though and do the following
as you've already calculated wcwidth().
if (!width && uc_combining_class(wide_char))
chars--;
I did notice that wcwidth(0x1B44) returns 1 but I think that is because
this combining char is new in unicode version 5.0, and my locale tables
are probably not up to date. Search for "adeg adeg" here:
http://unicode.org/versions/Unicode5.0.0/ch11.pdf
I also notice the gnulib/uniwidth/ functions which might be more up to date
and calculate wcwidth(0x1B44) correctly as 0?
thanks again,
Pádraig
- horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/06
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
- Re: horrible utf-8 performace in wc,
Pádraig Brady <=
- Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
- Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
Re: horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/07
Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08