[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: performance bug of `wc -m`
From: |
Peng Yu |
Subject: |
Re: performance bug of `wc -m` |
Date: |
Sun, 13 May 2018 09:05:47 -0400 |
I am on Mac not on Linux. On Linux, I can confirm that `wc -m` is much
faster than `wcm.py`.
Here is the output on Mac.
$ seq 1000000 > num.txt
$ time wc -m < num.txt
6888896
real 0m2.751s
user 0m2.622s
sys 0m0.042s
$ time ./wcm.py < num.txt
6888896
real 0m1.401s
user 0m1.234s
sys 0m0.051s
$ cat wcm.py
#!/usr/bin/env python
# vim: set noexpandtab tabstop=2 shiftwidth=2 softtabstop=-1 fileencoding=utf-8:
import sys
l = 0
for line in sys.stdin:
l += len(line.decode('utf-8'))
print l
On Sun, May 13, 2018 at 2:18 AM, Assaf Gordon <address@hidden> wrote:
> Hello,
>
> On 12/05/18 07:55 PM, Peng Yu wrote:
>>
>> The following example shows that `wc -m` is even slower than the
>> equivalent Python code. Can this performance bug be fixed?
>
>
> I'm unable to reproduce the performance issue,
> and suspect other issues are at play.
>
> First:
>>
>> import sys
>> l = 0
>> for line in sys.stdin:
>> l += len(line.rstrip('\n').decode('utf-8'))
>> print l
>
>
> This code is not identical to "wc -m" - it does not count the newlines
> as characters. Example:
>
> $ seq 10 | wc -m
> 21
> $ seq 10 | ./wcm.py
> 11
>
>> $ time ./wcm.py < 1.txt
>> 6786930
>> $ time wc -m < 1.txt
>> 6796930
>
>
> The fact that you are getting the exact same results indicates that your
> input file (1.txt) does not have newlines at all:
>
> $ seq 10 | tr -d '\n' | ./wcm.py
> 11
> $ seq 10 | tr -d '\n' | wc -m
> 11
>
>
> Second:
> I suspect the OS's file caching plays a big role in the skewed results.
> It would be better to clear the cache and then time it:
>
> $ seq 1000000 | tr -d '\n' > 1.txt
> $ ls -lhog 1.txt
> -rw-r--r-- 1 5.7M May 13 00:05 1.txt
>
> $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
> $ time wc -m < 1.txt
> 5888896
>
> real 0m0.136s
> user 0m0.104s
> sys 0m0.004s
>
> versus:
>
> $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
> $ time ./wcm.py < 1.txt
> 5888896
>
> real 0m0.215s
> user 0m0.040s
> sys 0m0.012s
>
> In my measurements python is twice as slow (for input with no newlines).
> But the file is so small (5.7MB) that measurements can vary a lot.
>
>
> Third:
> If the file does have new lines (as is more common in typical text
> files), then python becomes almost order of magnitude slower:
>
> $ seq 1000000 > 2.txt
> $ ls -lhog 2.txt
> -rw-r--r-- 1 6.6M May 13 00:08 2.txt
>
> $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
> $ time wc -m < 2.txt
> 6888896
>
> real 0m0.158s
> user 0m0.132s
> sys 0m0.000s
>
> $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
> $ time ./wcm.py < 2.txt
> 5888896
>
> real 0m1.260s
> user 0m1.104s
> sys 0m0.016s
>
>
>
> Fourth,
> Unless you are certain your input files are valid,
> using python2 + utf8 is very fragile, example:
>
> $ printf '\xEEabc\n' | ./wcm.py
> Traceback (most recent call last):
> File "./wcm.py", line 5, in <module>
> l += len(line.rstrip('\n').decode('utf-8'))
> File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
> return codecs.utf_8_decode(input, errors, True)
> UnicodeDecodeError: 'utf8' codec can't decode byte 0xee in position 0:
> invalid continuation byte
>
> While 'wc -m' will continue and not crash:
>
> $ printf '\xEEabc\n' | wc -m
> 4
>
>
>
> I hope this resolves the issue.
> If you still think this is a bug, please provide more details
> and a reproducible example.
>
> regards,
> - assaf
--
Regards,
Peng
- performance bug of `wc -m`, Peng Yu, 2018/05/12
- Re: performance bug of `wc -m`, Philip Rowlands, 2018/05/13
- Re: performance bug of `wc -m`, Assaf Gordon, 2018/05/14
- Re: performance bug of `wc -m`, Eric Fischer, 2018/05/16
- Re: performance bug of `wc -m`, Eric Fischer, 2018/05/16
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18
- Re: performance bug of `wc -m`, Bernhard Voelker, 2018/05/18
- Re: performance bug of `wc -m`, Pádraig Brady, 2018/05/18