[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: performance bug of `wc -m`
From: |
Assaf Gordon |
Subject: |
Re: performance bug of `wc -m` |
Date: |
Sun, 13 May 2018 00:18:36 -0600 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0 |
Hello,
On 12/05/18 07:55 PM, Peng Yu wrote:
The following example shows that `wc -m` is even slower than the
equivalent Python code. Can this performance bug be fixed?
I'm unable to reproduce the performance issue,
and suspect other issues are at play.
First:
import sys
l = 0
for line in sys.stdin:
l += len(line.rstrip('\n').decode('utf-8'))
print l
This code is not identical to "wc -m" - it does not count the newlines
as characters. Example:
$ seq 10 | wc -m
21
$ seq 10 | ./wcm.py
11
$ time ./wcm.py < 1.txt
6786930
$ time wc -m < 1.txt
6796930
The fact that you are getting the exact same results indicates that your
input file (1.txt) does not have newlines at all:
$ seq 10 | tr -d '\n' | ./wcm.py
11
$ seq 10 | tr -d '\n' | wc -m
11
Second:
I suspect the OS's file caching plays a big role in the skewed results.
It would be better to clear the cache and then time it:
$ seq 1000000 | tr -d '\n' > 1.txt
$ ls -lhog 1.txt
-rw-r--r-- 1 5.7M May 13 00:05 1.txt
$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time wc -m < 1.txt
5888896
real 0m0.136s
user 0m0.104s
sys 0m0.004s
versus:
$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time ./wcm.py < 1.txt
5888896
real 0m0.215s
user 0m0.040s
sys 0m0.012s
In my measurements python is twice as slow (for input with no newlines).
But the file is so small (5.7MB) that measurements can vary a lot.
Third:
If the file does have new lines (as is more common in typical text
files), then python becomes almost order of magnitude slower:
$ seq 1000000 > 2.txt
$ ls -lhog 2.txt
-rw-r--r-- 1 6.6M May 13 00:08 2.txt
$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time wc -m < 2.txt
6888896
real 0m0.158s
user 0m0.132s
sys 0m0.000s
$ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
$ time ./wcm.py < 2.txt
5888896
real 0m1.260s
user 0m1.104s
sys 0m0.016s
Fourth,
Unless you are certain your input files are valid,
using python2 + utf8 is very fragile, example:
$ printf '\xEEabc\n' | ./wcm.py
Traceback (most recent call last):
File "./wcm.py", line 5, in <module>
l += len(line.rstrip('\n').decode('utf-8'))
File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
return codecs.utf_8_decode(input, errors, True)
UnicodeDecodeError: 'utf8' codec can't decode byte 0xee in position 0:
invalid continuation byte
While 'wc -m' will continue and not crash:
$ printf '\xEEabc\n' | wc -m
4
I hope this resolves the issue.
If you still think this is a bug, please provide more details
and a reproducible example.
regards,
- assaf