Re: horrible utf-8 performace in wc

bug-coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc

From:	Pádraig Brady
Subject:	Re: horrible utf-8 performace in wc
Date:	Thu, 8 May 2008 02:06:08 +0100
User-agent:	Thunderbird 2.0.0.6 (X11/20071008)

Bo Borgerson wrote:
> Pádraig Brady wrote:
>> In the first 65535 code points there are also 404 chars which are
>> not classed as combining in the unicode database, but are classed
>> as zero width in the glibc locale data at least (zero-width space
>> being one of them like you mentioned). I determined this with the
>> attached progs:
>>
>> ./zw | python unidata.py | grep " 0 " | wc -l
> 
> 
> Hi Pádraig,
> 
> Wow, I knew there were some stand-alone zero-width characters, but I had
> no idea there were so many!

I'm not sure should many of those be counted anyway.
But the combining class is all we have to go on.

> 
> I poked around a little in gnulib and found a function for determining
> the combining class of a Unicode character.
> 
> I think the attached patch does what you were intending to do, and it
> also counts all of the stand-alone zero-width characters you found:

cool, thanks.
Could you could optimize it though and do the following
as you've already calculated wcwidth().

  if (!width && uc_combining_class(wide_char))
    chars--;

I did notice that wcwidth(0x1B44) returns 1 but I think that is because
this combining char is new in unicode version 5.0, and my locale tables
are probably not up to date. Search for "adeg adeg" here:
http://unicode.org/versions/Unicode5.0.0/ch11.pdf
I also notice the gnulib/uniwidth/ functions which might be more up to date
and calculate wcwidth(0x1B44) correctly as 0?

thanks again,
Pádraig

[Prev in Thread]

Current Thread

[Next in Thread]

horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/06
- Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
  - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
    - Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/07
    - Re: horrible utf-8 performace in wc, Pádraig Brady <=
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
    - Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/07
    - Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
  - Re: horrible utf-8 performace in wc, Jan Engelhardt, 2008/05/07
  - Re: horrible utf-8 performace in wc, Jim Meyering, 2008/05/07
    - Re: locales for testing, Bruno Haible, 2008/05/08
  - Re: horrible utf-8 performace in wc, Bruno Haible, 2008/05/08
    - Re: horrible utf-8 performace in wc, Bo Borgerson, 2008/05/08
    - Re: horrible utf-8 performace in wc, Pádraig Brady, 2008/05/08

Prev by Date: Re: horrible utf-8 performace in wc
Next by Date: Re: horrible utf-8 performace in wc
Previous by thread: Re: horrible utf-8 performace in wc
Next by thread: Re: horrible utf-8 performace in wc
Index(es):
- Date
- Thread