[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc

From: Pádraig Brady
Subject: Re: horrible utf-8 performace in wc
Date: Thu, 8 May 2008 02:06:08 +0100
User-agent: Thunderbird (X11/20071008)

Bo Borgerson wrote:
> Pádraig Brady wrote:
>> In the first 65535 code points there are also 404 chars which are
>> not classed as combining in the unicode database, but are classed
>> as zero width in the glibc locale data at least (zero-width space
>> being one of them like you mentioned). I determined this with the
>> attached progs:
>> ./zw | python unidata.py | grep " 0 " | wc -l
> Hi Pádraig,
> Wow, I knew there were some stand-alone zero-width characters, but I had
> no idea there were so many!

I'm not sure should many of those be counted anyway.
But the combining class is all we have to go on.

> I poked around a little in gnulib and found a function for determining
> the combining class of a Unicode character.
> I think the attached patch does what you were intending to do, and it
> also counts all of the stand-alone zero-width characters you found:

cool, thanks.
Could you could optimize it though and do the following
as you've already calculated wcwidth().

  if (!width && uc_combining_class(wide_char))

I did notice that wcwidth(0x1B44) returns 1 but I think that is because
this combining char is new in unicode version 5.0, and my locale tables
are probably not up to date. Search for "adeg adeg" here:
I also notice the gnulib/uniwidth/ functions which might be more up to date
and calculate wcwidth(0x1B44) correctly as 0?

thanks again,

reply via email to

[Prev in Thread] Current Thread [Next in Thread]