bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc


From: Pádraig Brady
Subject: Re: horrible utf-8 performace in wc
Date: Thu, 8 May 2008 02:06:08 +0100
User-agent: Thunderbird 2.0.0.6 (X11/20071008)

Bo Borgerson wrote:
> Pádraig Brady wrote:
>> In the first 65535 code points there are also 404 chars which are
>> not classed as combining in the unicode database, but are classed
>> as zero width in the glibc locale data at least (zero-width space
>> being one of them like you mentioned). I determined this with the
>> attached progs:
>>
>> ./zw | python unidata.py | grep " 0 " | wc -l
> 
> 
> Hi Pádraig,
> 
> Wow, I knew there were some stand-alone zero-width characters, but I had
> no idea there were so many!

I'm not sure should many of those be counted anyway.
But the combining class is all we have to go on.

> 
> I poked around a little in gnulib and found a function for determining
> the combining class of a Unicode character.
> 
> I think the attached patch does what you were intending to do, and it
> also counts all of the stand-alone zero-width characters you found:

cool, thanks.
Could you could optimize it though and do the following
as you've already calculated wcwidth().

  if (!width && uc_combining_class(wide_char))
    chars--;

I did notice that wcwidth(0x1B44) returns 1 but I think that is because
this combining char is new in unicode version 5.0, and my locale tables
are probably not up to date. Search for "adeg adeg" here:
http://unicode.org/versions/Unicode5.0.0/ch11.pdf
I also notice the gnulib/uniwidth/ functions which might be more up to date
and calculate wcwidth(0x1B44) correctly as 0?

thanks again,
Pádraig




reply via email to

[Prev in Thread] Current Thread [Next in Thread]