coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: performance bug of `wc -m`


From: Assaf Gordon
Subject: Re: performance bug of `wc -m`
Date: Sun, 13 May 2018 00:18:36 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0

Hello,

On 12/05/18 07:55 PM, Peng Yu wrote:
The following example shows that `wc -m` is even slower than the
equivalent Python code. Can this performance bug be fixed?

I'm unable to reproduce the performance issue,
and suspect other issues are at play.

First:
import sys
l = 0
for line in sys.stdin:
     l += len(line.rstrip('\n').decode('utf-8'))
print l

This code is not identical to "wc -m" - it does not count the newlines
as characters. Example:

  $ seq 10 | wc -m
  21
  $ seq 10 | ./wcm.py
  11

$ time ./wcm.py < 1.txt
6786930
$ time wc -m < 1.txt
6796930

The fact that you are getting the exact same results indicates that your
input file (1.txt) does not have newlines at all:

  $ seq 10 | tr -d '\n' | ./wcm.py
  11
  $ seq 10 | tr -d '\n' | wc -m
  11


Second:
I suspect the OS's file caching plays a big role in the skewed results.
It would be better to clear the cache and then time it:

  $ seq 1000000 | tr -d '\n' > 1.txt
  $ ls -lhog 1.txt
  -rw-r--r-- 1 5.7M May 13 00:05 1.txt

  $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
  $ time wc -m < 1.txt
  5888896

  real    0m0.136s
  user    0m0.104s
  sys     0m0.004s

versus:

   $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
   $ time ./wcm.py < 1.txt
   5888896

  real    0m0.215s
  user    0m0.040s
  sys     0m0.012s

In my measurements python is twice as slow (for input with no newlines).
But the file is so small (5.7MB) that measurements can vary a lot.


Third:
If the file does have new lines (as is more common in typical text
files), then python becomes almost order of magnitude slower:

  $ seq 1000000 > 2.txt
  $ ls -lhog 2.txt
  -rw-r--r-- 1 6.6M May 13 00:08 2.txt

  $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
  $ time wc -m < 2.txt
  6888896

  real    0m0.158s
  user    0m0.132s
  sys     0m0.000s

  $ sudo sh -c "sync; echo 3 > /proc/sys/vm/drop_caches"
  $ time ./wcm.py < 2.txt
  5888896

  real    0m1.260s
  user    0m1.104s
  sys     0m0.016s



Fourth,
Unless you are certain your input files are valid,
using python2 + utf8 is very fragile, example:

  $ printf '\xEEabc\n' | ./wcm.py
  Traceback (most recent call last):
    File "./wcm.py", line 5, in <module>
      l += len(line.rstrip('\n').decode('utf-8'))
    File "/usr/lib/python2.7/encodings/utf_8.py", line 16, in decode
      return codecs.utf_8_decode(input, errors, True)
  UnicodeDecodeError: 'utf8' codec can't decode byte 0xee in position 0:
  invalid continuation byte

While 'wc -m' will continue and not crash:

  $ printf '\xEEabc\n' | wc -m
  4



I hope this resolves the issue.
If you still think this is a bug, please provide more details
and a reproducible example.

regards,
 - assaf



reply via email to

[Prev in Thread] Current Thread [Next in Thread]