coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

shuf implementation - should/could it be "reservoir sampling "


From: Cook, Malcolm
Subject: shuf implementation - should/could it be "reservoir sampling "
Date: Tue, 27 Nov 2012 10:04:47 -0600

There is a discussion at http://news.ycombinator.com/item?id=4833546 comparing 
shuf with dimsum 
(http://blog.noblemail.ca/2012/11/how-to-get-random-lines-out-of-file-or.html)

dimsum uses reservoir sampling so it should have a constant memory footprint 
proportional to -n (though due to a bug with fix in progress, it currently does 
not)

I note that shuf does not have a constant memory footprint.   This can be 
observed, i.e. , with

yes | shuf -n 5

 and running top to watch the memory grow seeming boundlessly.

1)  what is the underlying algorithm ??
2) should/could it be changed to use "reservoir sampling" to obtain benefits of 
constant memory footprint?

Thanks for your consideration!

Malcolm Cook
Computational Biology - Stowers Institute for Medical Research





reply via email to

[Prev in Thread] Current Thread [Next in Thread]