--- Begin Message ---
Subject: |
Re: bug#7182: sort -R slow |
Date: |
Sun, 07 Aug 2011 22:42:52 +0200 |
Davide Brini wrote:
> On Sat, 9 Oct 2010 14:52:41 +0200 Ole Tange <address@hidden> wrote:
>
>> I recently needed to randomize some lines. So I tried using 'sort -R'.
>> I was astonished how slow that was. So I tested how slow a competing
>> strategies are. GNU sort is two magnitudes slower than unsort and more
>> than one magnitude slower than perl:
>>
>> $ time unsort file
>> real 0m1.388s
>>
>> $ unsort --version
>> unsort 1.1.2
>>
>> $ time perl -e 'print sort { rand() <=> rand() } <>' file
>> real 0m6.621s
>>
>> $ time sort -R file
>> real 4m8.403s
>>
>> $ sort --version
>> sort (GNU coreutils) 8.5
>>
>> What is even scarier: sort without -R is faster than sort -R:
>>
>> $ time sort file
>> real 0m53.553s
>>
>> I would expect sort -R to be faster than sort and faster than Perl if
>> not as fast as unsort.
>
> On my system, locale settings seem to impact the runtime significantly:
>
> $ wc -l bigfile
> 1000000 bigfile
>
> $ time LC_ALL=en_US.utf8 sort -R bigfile > /dev/null
>
> real 1m29.302s
> user 1m21.009s
> sys 0m0.155s
>
> $ time LC_ALL=C sort -R bigfile > /dev/null
>
> real 0m38.881s
> user 0m35.276s
> sys 0m0.118s
>
>
> However, shuf is much faster, and seems mostly unaffected by the locale
> used:
>
> $ time shuf bigfile > /dev/null
>
> real 0m1.044s
> user 0m0.833s
> sys 0m0.042s
Thanks for the report.
I think the performance of sort -R will often be worse
than that of shuf (by design, since it accesses each byte of each line
once more, to compute the hash), except when the input size is larger
than available memory.
The info documentation for sort -R does refer to "shuf".
Any suggestions for improvements are welcome.
I'm closing this.
You're welcome to reopen or file a new report.
--- End Message ---