coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] shuf: use reservoir-sampling when possible


From: Pádraig Brady
Subject: Re: [PATCH] shuf: use reservoir-sampling when possible
Date: Mon, 25 Mar 2013 03:45:11 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2

On 03/11/2013 09:10 PM, Assaf Gordon wrote:
> Hello,
> 
> Pádraig Brady wrote, On 03/07/2013 06:26 PM:
>> On 03/07/2013 07:32 PM, Assaf Gordon wrote:
>>> Pádraig Brady wrote, On 03/06/2013 08:24 PM:
>>>> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>>>>> Attached is a suggestion to implement reservoir-sampling in shuf:
>>>>> When the expected output of lines is known, it will not load the entire 
>>>>> file into memory - allowing shuffling very large inputs.
>>>>>
>>>
>>>
>>>>> static size_t
>>>>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline, 
>>>>> size_t k,
>>>>>                                struct randint_source *s)
>>> <...>
>>>>>   struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init 
>>>>> reservoir*/
>>>>
>>>> Since this change is mainly about efficient mem usage we should probably 
>>>> handle
>>>> the case where we have small inputs but large k.  This will allocate (and 
>>>> zero)
>>>> memory up front. The zeroing will defeat any memory overcommit configured 
>>>> on the
>>>> system, but it's probably better to avoid the large initial commit and 
>>>> realloc
>>>> as required (not per line, but per 1K lines maybe).
>>>>
>>>
> 
> Attached is an updated version (mostly a re-write of the memory allocation 
> part), as per the comment above.
> Also includes a very_expensive valgrind test to exercise the new code.

I've attached 9 patches to adjust things a bit.

There are a couple of interesting ones.

One is to handle various overflows correctly,
and to support UINTMAX_MAX input lines rather than SIZE_MAX
(which is smaller on 32 bit systems).

Another is to not enable reservoir sampling for files < 8 MiB,
as otherwise you'd incur CPU overhead as shown here:

# Create some input files
$ for p in $(seq 7); do src/seq $((10**$p)) > 10p$p.in; done

$ for p in $(seq 7); do time shuf-old -n10 10p$p.in >/dev/null; done

real    0m0.003s
real    0m0.003s
real    0m0.003s
real    0m0.004s
real    0m0.013s
real    0m0.053s
real    0m1.993s

$ for p in $(seq 7); do time shuf-new -n10 10p$p.in >/dev/null; done

real    0m0.006s
real    0m0.004s
real    0m0.004s
real    0m0.005s
real    0m0.019s
real    0m0.082s
real    0m0.859s

So the above shows that reservoir sampling introduces CPU overhead
for <= 1 million lines of input, but once mem usage is significantly
reduced at 10 million lines, we get CPU/cache benefits there too.

Drilling in a bit to see the CPU overhead introduced for smaller inputs
(10p6.in in this case)...

$ for p in 1 10 12 15 20; do time shuf-old 10p6.in -n $((2**$p)) > /dev/null; 
done

real    0m0.048s
real    0m0.044s
real    0m0.055s
real    0m0.137s
real    0m0.246s

$ for p in 1 10 12 15 20; do time shuf-new 10p6.in -n $((2**$p)) > /dev/null; 
done

real    0m0.096s
real    0m0.084s
real    0m0.088s
real    0m0.103s
real    0m0.509s

So we can see nearly a doubling of the runtime.
I suppose we might be more dynamic here and not bother with stating files,
instead switching to reservoir only when we reach 1 million lines or so.
That would have the benefit of appropriately handling all inputs including 
pipes?
If it's simple enough I'll do that tomorrow instead.

I'll squash all these to a single commit, and merge tomorrow hopefully.

> (and the other patch is the uniform-distribution randomness test).

I'll keep this separate for the moment.

thanks,
Pádraig,

Attachment: shuf-reservoir-fixes.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]