[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: horrible utf-8 performace in wc

From: Pádraig Brady
Subject: Re: horrible utf-8 performace in wc
Date: Wed, 7 May 2008 12:11:34 +0100
User-agent: Thunderbird (X11/20071008)

Jan Engelhardt wrote:
> https://bugzilla.novell.com/show_bug.cgi?id=381873
> Forwarding this because it is a GNU issue, not specifically a Novell one.
> I reproduced this myself with the latest coreutils from git
> (BTW: You might want to repack that repo, "counting objects" during the
> clone was rather slow in the initial counting.)
> Could it be a libiconv problem?

Accounting for multibyte characters is what's taking the time:

~/git/coreutils/src$ time ./wc -m long_lines.txt
13357046 long_lines.txt
real    0m1.860s

~/git/coreutils/src$ time ./wc -c long_lines.txt
13538735 long_lines.txt
real    0m0.002s

Now that is a _lot_ of extra time. libiconv could probably be
made more efficient. I've never actually looked at it.
However wc calls mbrtowc() for each multibyte character.
It would probably be a lot more efficient to use mbstowcs()
to convert the whole read buffer.

Note mbstowcs doesn't handle embedded NULs so one would
need to find these first, and iterate over each substring,
as I did in my version of uniq previously mentioned.

Also mbstowcs doesn't canonicalize equivalent multibyte sequences,
and so therefore functions the same in this regard as our
processing of each wide character separately.
This could be considered a bug actually- i.e. should -m give
the number of wide chars, or the number of multibyte chars?
With the attached patch, `wc -m` gives 23 chars for both these lines.

canonically équivalent
canonically équivalent


p.s. I Notice that gnome-terminal still doesn't handle
combining characters correctly, and my mail client thunderbird
is putting the accent on the q rather than the e, sigh.
diff --git a/src/wc.c b/src/wc.c
index 61ab485..f7f7109 100644
--- a/src/wc.c
+++ b/src/wc.c
@@ -368,6 +368,8 @@ wc (int fd, char const *file_x, struct fstatus *fstatus)
                            linepos += width;
                          if (iswspace (wide_char))
                            goto mb_word_separator;
+                         else if (width == 0)
+                           chars--; /* don't count combining chars */
                          in_word = true;

reply via email to

[Prev in Thread] Current Thread [Next in Thread]