[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: performance bug of `wc -m`
From: |
Assaf Gordon |
Subject: |
Re: performance bug of `wc -m` |
Date: |
Sun, 13 May 2018 18:21:09 -0600 |
User-agent: |
NeoMutt/20170113 (1.7.2) |
Hello,
On Sun, May 13, 2018 at 11:05:11PM +0100, Philip Rowlands wrote:
> On Sun, 13 May 2018, at 02:55, Peng Yu wrote:
> > The following example shows that `wc -m` is even slower than the
> > equivalent Python code. Can this performance bug be fixed?
>
> I can reproduce the slow wc behaviour with UTF-8 enabled locales.
As this thread expands, it is important to be as precise as
possible as to what is observed, and in which environments.
> $ seq 1000000 | time -p wc -c
> $ seq 1000000 | time -p wc -m
> $ seq 1000000 | LANG=C time -p wc -m
[...]
> In the slow case, wc is spending most of its time in iswprint / wcwidth /
> iswspace.
So far we observed the followings when using gnu coreutils' wc:
1. running "wc -m" in multibyte locale will always be slower than "wc -c".
2. running "wc -c" should take more or less the same time as "LC_ALL=C wc -m".
3. Under GNU/Linux in multibyte locale, "wc -m" is faster than the attached
python script (wcm.py).
What Peng Yu reported is that in Mac OS X with multibyte locale
the python script is faster than gnu's "wc -m".
I currently do not have access to a Mac OSX machine.
Testing on FreeBSD (which should be similar enough)
I still can not reproduce this issue (ie. I find gnu's "wc" is faster
than "wcm.py" in all circumstances).
Phil,
When you write "slow", do you mean that "wc -m" was slower than running
a python script? or slower than "wc -c" ?
If python script, can you provide more information about your environment
(OS, python version, wc --version, locale) ?
> Perhaps wc could learn a faster method of counting utf-8
> (https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to
> utf-8
> everywhere marches on.
There are many UTF8-specific optimizations, and gnulib has many
of them implemented. But using the POSIX standard
multibyte functions (e.g. iswprint/wcwidth) ensures 'wc' works not
only in UTF8 but in all multibyte locales.
There is always a possibility of adding yet more code for UTF8 specific
inputs - there are pros and cons to that approach.
regards,
- assaf
- performance bug of `wc -m`, Peng Yu, 2018/05/12
- Re: performance bug of `wc -m`, Philip Rowlands, 2018/05/13
- Re: performance bug of `wc -m`,
Assaf Gordon <=
- Re: performance bug of `wc -m`, Eric Fischer, 2018/05/16
- Re: performance bug of `wc -m`, Eric Fischer, 2018/05/16
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18
- Re: performance bug of `wc -m`, Bernhard Voelker, 2018/05/18
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18
- Re: performance bug of `wc -m`, Eric Fischer, 2018/05/18
- Re: performance bug of `wc -m`, Eric Fischer, 2018/05/18
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18
- Re: performance bug of `wc -m`, L A Walsh, 2018/05/18