coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: performance bug of `wc -m`


From: Philip Rowlands
Subject: Re: performance bug of `wc -m`
Date: Sun, 13 May 2018 23:05:11 +0100

On Sun, 13 May 2018, at 02:55, Peng Yu wrote:
> Hi,
> 
> The following example shows that `wc -m` is even slower than the
> equivalent Python code. Can this performance bug be fixed?

I can reproduce the slow wc behaviour with UTF-8 enabled locales.

$ echo $LANG
en_GB.UTF-8

$ seq 1000000 | time -p wc -c
6888896
real 0.05
user 0.00
sys 0.02

$ seq 1000000 | time -p wc -m
6888896
real 0.60
user 0.58
sys 0.00

$ seq 1000000 | LANG=C time -p wc -m
6888896
real 0.05
user 0.00
sys 0.02

In the slow case, wc is spending most of its time in iswprint / wcwidth / 
iswspace. Perhaps wc could learn a faster method of counting utf-8 
(https://stackoverflow.com/a/7298149); this may be worthwhile as the trend to 
utf-8 everywhere marches on.

I can't explain without more digging why Python's string decode('utf-8') is 
better optimised for length calculations.

Cheers,
Phil



reply via email to

[Prev in Thread] Current Thread [Next in Thread]