[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[PATCH] shuf: use reservoir-sampling when possible
From: |
Assaf Gordon |
Subject: |
[PATCH] shuf: use reservoir-sampling when possible |
Date: |
Wed, 06 Mar 2013 18:50:30 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4 |
Hello,
Attached is a suggestion to implement reservoir-sampling in shuf:
When the expected output of lines is known, it will not load the entire file
into memory - allowing shuffling very large inputs.
I've seen this mentioned once:
http://lists.gnu.org/archive/html/coreutils/2012-11/msg00079.html
but no follow-up discussion.
There is no change in the usage of shuf (barring unexpected bugs...).
Example (with debug messages):
===
$ seq 10000 | ./src/shuf ---debug -n 5
--reservoir_sampling--
filling reservoir, input line 1 of 5: '1'
filling reservoir, input line 2 of 5: '2'
filling reservoir, input line 3 of 5: '3'
filling reservoir, input line 4 of 5: '4'
filling reservoir, input line 5 of 5: '5'
Replacing reservoir sample 4 with line 7 '7'
Replacing reservoir sample 4 with line 8 '8'
Replacing reservoir sample 3 with line 9 '9'
Replacing reservoir sample 2 with line 10 '10'
Replacing reservoir sample 4 with line 11 '11'
Replacing reservoir sample 4 with line 16 '16'
Replacing reservoir sample 4 with line 17 '17'
Replacing reservoir sample 4 with line 20 '20'
Replacing reservoir sample 2 with line 22 '22'
Replacing reservoir sample 0 with line 31 '31'
Replacing reservoir sample 1 with line 52 '52'
Replacing reservoir sample 4 with line 55 '55'
Replacing reservoir sample 3 with line 61 '61'
Replacing reservoir sample 4 with line 76 '76'
Replacing reservoir sample 2 with line 169 '169'
Replacing reservoir sample 2 with line 187 '187'
Replacing reservoir sample 0 with line 216 '216'
Replacing reservoir sample 1 with line 340 '340'
Replacing reservoir sample 4 with line 431 '431'
Replacing reservoir sample 1 with line 524 '524'
Replacing reservoir sample 2 with line 942 '942'
Replacing reservoir sample 1 with line 1096 '1096'
Replacing reservoir sample 2 with line 1627 '1627'
Replacing reservoir sample 4 with line 1763 '1763'
Replacing reservoir sample 2 with line 2679 '2679'
Replacing reservoir sample 3 with line 4382 '4382'
Replacing reservoir sample 2 with line 4439 '4439'
Replacing reservoir sample 3 with line 7748 '7748'
Replacing reservoir sample 2 with line 9902 '9902'
-- reservoir lines (begin)--
216
1096
9902
7748
1763
-- reservoir lines (end)--
216
1763
7748
1096
9902
===
The last 5 lines are the final output (the rest is STDERR debug messages).
After the input is read completely, the lines are still re-permuted (using the
existing shuf code), to accommodate cases like:
===
$ seq 6 | ./src/shuf ---debug -n 5
--reservoir_sampling--
filling reservoir, input line 1 of 5: '1'
filling reservoir, input line 2 of 5: '2'
filling reservoir, input line 3 of 5: '3'
filling reservoir, input line 4 of 5: '4'
filling reservoir, input line 5 of 5: '5'
Replacing reservoir sample 2 with line 6 '6'
-- reservoir lines (begin)--
1
2
6
4
5
-- reservoir lines (end)--
4
2
1
6
5
===
Comments are welcomed,
-gordon
0001-shuf-use-reservoir-sampling-when-possible.patch
Description: Text Data
- [PATCH] shuf: use reservoir-sampling when possible,
Assaf Gordon <=
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/06
- Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/07
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/07
- Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/11
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/25
- Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/25
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/25