[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc

From: Bo Borgerson
Subject: Re: horrible utf-8 performace in wc
Date: Wed, 07 May 2008 18:29:22 -0400
User-agent: Thunderbird (X11/20080227)

Pádraig Brady wrote:
> In the first 65535 code points there are also 404 chars which are
> not classed as combining in the unicode database, but are classed
> as zero width in the glibc locale data at least (zero-width space
> being one of them like you mentioned). I determined this with the
> attached progs:
> ./zw | python unidata.py | grep " 0 " | wc -l

Hi Pádraig,

Wow, I knew there were some stand-alone zero-width characters, but I had
no idea there were so many!

I poked around a little in gnulib and found a function for determining
the combining class of a Unicode character.

I think the attached patch does what you were intending to do, and it
also counts all of the stand-alone zero-width characters you found:

$ ./zw | python unidata.py | grep " 0 " | perl packu.pl | src/wc -m

$ src/wc -m 2char
2 2char

Please note that this requires a re-run of `./bootstrap', since it needs
to bring some extra stuff in from gnulib.

Hope that helps.

diff --git a/bootstrap.conf b/bootstrap.conf
index 8bde0ad..ef5a328 100644
--- a/bootstrap.conf
+++ b/bootstrap.conf
@@ -82,6 +82,7 @@ gnulib_modules="
        strpbrk strtoimax strtoumax strverscmp sys_stat timespec tzset
+       unictype/combining-class
        unicodeio unistd-safer unlink-busy unlinkdir unlocked-io
diff --git a/src/wc.c b/src/wc.c
index 61ab485..ed6630c 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -32,6 +32,8 @@
 #include "readtokens0.h"
 #include "safe-read.h"
+#include "unictype.h"
 #if !defined iswspace && !HAVE_ISWSPACE
 # define iswspace(wc) \
     ((wc) == to_uchar (wc) && isspace (to_uchar (wc)))
@@ -368,6 +370,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
                            linepos += width;
                          if (iswspace (wide_char))
                            goto mb_word_separator;
+                         else if (uc_combining_class (wide_char) != 0)
+                           chars--; /* don't count combining chars */
                          in_word = true;

Attachment: packu.pl
Description: Perl program


reply via email to

[Prev in Thread] Current Thread [Next in Thread]