[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] shuf: use reservoir-sampling when possible
From: |
Assaf Gordon |
Subject: |
Re: [PATCH] shuf: use reservoir-sampling when possible |
Date: |
Thu, 07 Mar 2013 14:32:25 -0500 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:10.0.4) Gecko/20120510 Icedove/10.0.4 |
Hello,
Attached is an updated version.
Pádraig Brady wrote, On 03/06/2013 08:24 PM:
> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>> Attached is a suggestion to implement reservoir-sampling in shuf:
>> When the expected output of lines is known, it will not load the entire file
>> into memory - allowing shuffling very large inputs.
>>
Regarding comments:
>> {"-debug", no_argument, NULL, DEV_DEBUG_OPTION},
> no need to keep this, for final commit.
Yes, I'll remove this once the code is acceptable.
>> prepare_shuf_lines (struct linebuffer *in_lines, size_t n, char ***out_lines,
>
> I've not looked into the details, but it would
> be nice to avoid the memcpy/conversion here
>
I've removed the conversion function, and instead added a new function to
output the lines directly.
>> static size_t
>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, size_t
>> k,
>> struct randint_source *s)
<...>
>> struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init
>> reservoir*/
>
> Since this change is mainly about efficient mem usage we should probably
> handle
> the case where we have small inputs but large k. This will allocate (and
> zero)
> memory up front. The zeroing will defeat any memory overcommit configured on
> the
> system, but it's probably better to avoid the large initial commit and realloc
> as required (not per line, but per 1K lines maybe).
>
I'm not quite sure about this:
The "reservoir-sampling" path can only be used when the user explicitly ask to
limit output lines.
I would naively assume that if a user explicitly asked to limit the output to
1,000,000 lines, he/she expects large input as well.
And so the (edge?) case of asking for a large number of output lines, but
supplying very small number of input lines is rare.
Wouldn't you agree? or is there a different typical usage case?
Also, the allocation only allocates an array of "struct linebuffer" (on 64bit
systems, 24 bytes).
So even asking for 1M lines will allocate 24MB of RAM - not "too much" on
modern machines.
----
The second attached patch is experimental - it tries to assess the randomness
of 'shuf' output by running it 1,000 times and checking if the output is (very
roughly) uniformly distributed.
I don't know if there were attempts in the past to unit-test randomness (and
then decided not to do it) - or if this was just never considered worth-while
(or too error prone).
Comments are welcomed,
-gordon
0001-shuf-add-expensive-test-for-randomness.patch
Description: Text Data
0002-shuf-use-reservoir-sampling-when-possible.patch
Description: Text Data
- [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/06
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/06
- Re: [PATCH] shuf: use reservoir-sampling when possible,
Assaf Gordon <=
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/07
- Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/11
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/25
- Re: [PATCH] shuf: use reservoir-sampling when possible, Assaf Gordon, 2013/03/25
- Re: [PATCH] shuf: use reservoir-sampling when possible, Pádraig Brady, 2013/03/25