bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

sort --random-sort


From: Frederik Eaton
Subject: sort --random-sort
Date: Sat, 26 Nov 2005 07:03:31 +0000
User-agent: mutt-ng/devel-r472 (Debian)

Hi all,

This is my second (or third? or fourth?) attempt at a patch to sort to
introduce shuffling behavior. However, I encourage people to take a
good look at it because it's much more "polished" than the others. 
I've included everything I think should be included - documentation,
ChangeLog entries, etc., so hopefully there won't have to be many more
changes.

I'm sorry it's taken me so long to return to this - I've moved
countries, been occupied with grad school, etc.

There are two patches:

- shred-isaac-split.patch
    splits ISAAC code out of shred.c into new files isaac.c, isaac.h

- sort-with-isaac.patch
    adds --sort-random etc. to sort.c, using ISAAC RNG.

They should be applied in this order.

Notes:

- The randomization is "cryptographically secure" due to use of ISAAC,
as has been requested by Paul Eggert. Earlier versions used a separate
random number implementation but it was pointed out that shred.c
already contained the necessary code (based on a library from Bob
Jenkins called ISAAC), so the main change in this version has been to
separate out our version of ISAAC from shred.c and use it instead. 
Ultimately I hope this can be moved to glibc, so that making things
cryptographically secure doesn't require any extra effort, but that's
a separate issue.

- The second patch changes a constant ISAAC_LOG in isaac.h from 8 to
3. The code explicitly says that this can be done, so, going by that
I'm assuming that the change is safe. Theoretically this could make
"shred" slower but I haven't noticed any effect, if there is one it is
surely less than 1%. The reason for the change is that the code I
added to 'sort' has to copy the ISAAC state twice with every
comparison, so when the ISAAC state is too large it slows things down
a bit. The change speeds "sort -R" up by a factor of 14 on a 1e5-line
test file.

$ wc -l /usr/share/dict/words
96274 /usr/share/dict/words
$ time ./inst-log-3/bin/sort -R < /usr/share/dict/words >/dev/null
./inst-log-3/bin/sort -R < /usr/share/dict/words > /dev/null  1.00s user 0.01s 
system 99% cpu 1.016 total
$ time ./inst-log-8/bin/sort -R < /usr/share/dict/words >/dev/null
./inst-log-8/bin/sort -R < /usr/share/dict/words > /dev/null  13.98s user 0.01s 
system 99% cpu 14.010 total

Now the speed of random-sort is comparable to normal sort, it is only
about 50% slower (in my earlier patch it was about 5 times slower).

$ time ./inst-log-3/bin/sort < /usr/share/dict/words >/dev/null
./inst-log-3/bin/sort < /usr/share/dict/words > /dev/null  0.68s user 0.01s 
system 99% cpu 0.695 total

- I recall some argument about randomness vs. pseudorandomness. I do
not mention pseudorandomness in the documentation, because I maintain
that it is not part of the interface. It is not mentioned in the
"shred" documentation either (only the word "random" is used), as I
believe is proper.

- I'm using "--random-sort" as the long option name, rather than
"--random" as in the previous versions, to be consistent with
"--numeric-sort", "--month-sort", etc.

Thanks to everyone who participated in the discussions regarding this
feature, and who critiqued the earlier versions.

Frederik

Attachment: shred-isaac-split.patch
Description: Text document

Attachment: sort-with-isaac.patch
Description: Text document


reply via email to

[Prev in Thread] Current Thread [Next in Thread]