coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: performance bug of `wc -m`


From: Assaf Gordon
Subject: Re: performance bug of `wc -m`
Date: Sun, 13 May 2018 18:21:09 -0600
User-agent: NeoMutt/20170113 (1.7.2)

Hello,

On Sun, May 13, 2018 at 11:05:11PM +0100, Philip Rowlands wrote:
> On Sun, 13 May 2018, at 02:55, Peng Yu wrote:
> > The following example shows that `wc -m` is even slower than the
> > equivalent Python code. Can this performance bug be fixed?
> 
> I can reproduce the slow wc behaviour with UTF-8 enabled locales.

As this thread expands, it is important to be as precise as
possible as to what is observed, and in which environments.

> $ seq 1000000 | time -p wc -c
> $ seq 1000000 | time -p wc -m
> $ seq 1000000 | LANG=C time -p wc -m
[...]
> In the slow case, wc is spending most of its time in iswprint / wcwidth / 
> iswspace. 

So far we observed the followings when using gnu coreutils' wc:
1. running "wc -m" in multibyte locale will always be slower than "wc -c".
2. running "wc -c" should take more or less the same time as "LC_ALL=C wc -m".
3. Under GNU/Linux in multibyte locale, "wc -m" is faster than the attached
python script (wcm.py).

What Peng Yu reported is that in Mac OS X with multibyte locale
the python script is faster than gnu's "wc -m".

I currently do not have access to a Mac OSX machine.

Testing on FreeBSD (which should be similar enough)
I still can not reproduce this issue (ie. I find gnu's "wc" is faster
than "wcm.py" in all circumstances).

Phil,
When you write "slow", do you mean that "wc -m" was slower than running
a python script? or slower than "wc -c" ?
If python script, can you provide more information about your environment
(OS, python version, wc --version, locale) ?

> Perhaps wc could learn a faster method of counting utf-8 
> (https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to 
> utf-8 
> everywhere marches on.

There are many UTF8-specific optimizations, and gnulib has many
of them implemented. But using the POSIX standard
multibyte functions (e.g. iswprint/wcwidth) ensures 'wc' works not
only in UTF8 but in all multibyte locales.

There is always a possibility of adding yet more code for UTF8 specific
inputs - there are pros and cons to that approach.

regards,
 - assaf



reply via email to

[Prev in Thread] Current Thread [Next in Thread]