[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] shuf: use reservoir-sampling when possible
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] shuf: use reservoir-sampling when possible |
Date: |
Thu, 07 Mar 2013 23:26:03 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 03/07/2013 07:32 PM, Assaf Gordon wrote:
> Hello,
>
> Attached is an updated version.
>
> Pádraig Brady wrote, On 03/06/2013 08:24 PM:
>> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>>> Attached is a suggestion to implement reservoir-sampling in shuf:
>>> When the expected output of lines is known, it will not load the entire
>>> file into memory - allowing shuffling very large inputs.
>>>
>
> Regarding comments:
>
>>> {"-debug", no_argument, NULL, DEV_DEBUG_OPTION},
>> no need to keep this, for final commit.
>
> Yes, I'll remove this once the code is acceptable.
>
>
>>> prepare_shuf_lines (struct linebuffer *in_lines, size_t n, char
>>> ***out_lines,
>>
>> I've not looked into the details, but it would
>> be nice to avoid the memcpy/conversion here
>>
>
> I've removed the conversion function, and instead added a new function to
> output the lines directly.
>
>>> static size_t
>>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline,
>>> size_t k,
>>> struct randint_source *s)
> <...>
>>> struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init
>>> reservoir*/
>>
>> Since this change is mainly about efficient mem usage we should probably
>> handle
>> the case where we have small inputs but large k. This will allocate (and
>> zero)
>> memory up front. The zeroing will defeat any memory overcommit configured on
>> the
>> system, but it's probably better to avoid the large initial commit and
>> realloc
>> as required (not per line, but per 1K lines maybe).
>>
>
> I'm not quite sure about this:
> The "reservoir-sampling" path can only be used when the user explicitly ask
> to limit output lines.
> I would naively assume that if a user explicitly asked to limit the output to
> 1,000,000 lines, he/she expects large input as well.
> And so the (edge?) case of asking for a large number of output lines, but
> supplying very small number of input lines is rare.
> Wouldn't you agree? or is there a different typical usage case?
>
> Also, the allocation only allocates an array of "struct linebuffer" (on 64bit
> systems, 24 bytes).
> So even asking for 1M lines will allocate 24MB of RAM - not "too much" on
> modern machines.
This is a fair point, but we've got a class of
bug reports recently where we shouldn't fail with small inputs,
which is also valid. One could argue that we should
fail up front as we could never scale up to the requested limit.
On the other hand the input might only scale up as memory
becomes available in future, in which case it would be
better to fail only when we need memory and it's not available.
It shouldn't be too hard to reallocate the array as required.
>
> ----
>
> The second attached patch is experimental - it tries to assess the randomness
> of 'shuf' output by running it 1,000 times and checking if the output is
> (very roughly) uniformly distributed.
> I don't know if there were attempts in the past to unit-test randomness (and
> then decided not to do it) - or if this was just never considered worth-while
> (or too error prone).
Cool, I was considering testing with rngtest or something,
so it'll be good to have something independent.
thanks,
Pádraig.