[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
shuf implementation - should/could it be "reservoir sampling "
From: |
Cook, Malcolm |
Subject: |
shuf implementation - should/could it be "reservoir sampling " |
Date: |
Tue, 27 Nov 2012 10:04:47 -0600 |
There is a discussion at http://news.ycombinator.com/item?id=4833546 comparing
shuf with dimsum
(http://blog.noblemail.ca/2012/11/how-to-get-random-lines-out-of-file-or.html)
dimsum uses reservoir sampling so it should have a constant memory footprint
proportional to -n (though due to a bug with fix in progress, it currently does
not)
I note that shuf does not have a constant memory footprint. This can be
observed, i.e. , with
yes | shuf -n 5
and running top to watch the memory grow seeming boundlessly.
1) what is the underlying algorithm ??
2) should/could it be changed to use "reservoir sampling" to obtain benefits of
constant memory footprint?
Thanks for your consideration!
Malcolm Cook
Computational Biology - Stowers Institute for Medical Research
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- shuf implementation - should/could it be "reservoir sampling ",
Cook, Malcolm <=