[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] added ability in sort to skip n number of lines for each
From: |
Pádraig Brady |
Subject: |
Re: [coreutils] added ability in sort to skip n number of lines for each file |
Date: |
Tue, 23 Nov 2010 16:21:07 +0000 |
User-agent: |
Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.1.8) Gecko/20100227 Thunderbird/3.0.3 |
On 23/11/10 15:57, Jim Hester wrote:
> Below I have an updated proper patch, it is quite a bit larger than my
> first, but should address all of the concerns from Assaf and Pádraig.
>
> My main motivation here is not just to make this common operation less
> annoying, it was mostly for increased performance. I made a test
> dataset of 10 files with 3 header lines each and 500,000 lines to sort,
> then ran sort by using head and tail as Pádraig suggests, and then again
> using my implemented header skip on an 8 core machine. Larger files
> seem to show similar speed up as well. I believe this speedup comes
> from the fact that the multithreaded sort is trying to read from the
> buffer faster than tail can write to the buffer.
>
>>time { (head -q -n 3 test[0-9] | head -n 3; tail -q -n+4 test[0-9] |
> ./sort -n ) > out2; }
>
> real 0m51.660s
> user 2m0.324s
> sys 0m4.115s
>
>>time ./sort -n -l 3 test[0-9] > out
>
> real 0m31.834s
> user 2m17.775s
> sys 0m3.981s
>>diff out out2
The user time from the head;tail|sort
is lower than sort -l which suggests that
the first invocation was just waiting on disk?
Could you please repeat the test using precached data?
Currently the threads in `sort` are passed data that is read
sequentially from input files (as otherwise `sort`
would have to start worrying about device ids,
and /sys/block/<blockdev>/queue/rotational etc.
so as to not thrash disk heads). That kind of
logic is probably always best outside of `sort`.
cheers,
Pádraig.
Re: [coreutils] added ability in sort to skip n number of lines for each file, Assaf Gordon, 2010/11/22