[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [PATCH] shuf: use reservoir-sampling when possible
From: |
Pádraig Brady |
Subject: |
Re: [PATCH] shuf: use reservoir-sampling when possible |
Date: |
Mon, 25 Mar 2013 03:45:11 +0000 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130110 Thunderbird/17.0.2 |
On 03/11/2013 09:10 PM, Assaf Gordon wrote:
> Hello,
>
> Pádraig Brady wrote, On 03/07/2013 06:26 PM:
>> On 03/07/2013 07:32 PM, Assaf Gordon wrote:
>>> Pádraig Brady wrote, On 03/06/2013 08:24 PM:
>>>> On 03/06/2013 11:50 PM, Assaf Gordon wrote:
>>>>> Attached is a suggestion to implement reservoir-sampling in shuf:
>>>>> When the expected output of lines is known, it will not load the entire
>>>>> file into memory - allowing shuffling very large inputs.
>>>>>
>>>
>>>
>>>>> static size_t
>>>>> read_input_reservoir_sampling (FILE *in, char eolbyte, char ***pline,
>>>>> size_t k,
>>>>> struct randint_source *s)
>>> <...>
>>>>> struct linebuffer *rsrv = XCALLOC (k, struct linebuffer); /* init
>>>>> reservoir*/
>>>>
>>>> Since this change is mainly about efficient mem usage we should probably
>>>> handle
>>>> the case where we have small inputs but large k. This will allocate (and
>>>> zero)
>>>> memory up front. The zeroing will defeat any memory overcommit configured
>>>> on the
>>>> system, but it's probably better to avoid the large initial commit and
>>>> realloc
>>>> as required (not per line, but per 1K lines maybe).
>>>>
>>>
>
> Attached is an updated version (mostly a re-write of the memory allocation
> part), as per the comment above.
> Also includes a very_expensive valgrind test to exercise the new code.
I've attached 9 patches to adjust things a bit.
There are a couple of interesting ones.
One is to handle various overflows correctly,
and to support UINTMAX_MAX input lines rather than SIZE_MAX
(which is smaller on 32 bit systems).
Another is to not enable reservoir sampling for files < 8 MiB,
as otherwise you'd incur CPU overhead as shown here:
# Create some input files
$ for p in $(seq 7); do src/seq $((10**$p)) > 10p$p.in; done
$ for p in $(seq 7); do time shuf-old -n10 10p$p.in >/dev/null; done
real 0m0.003s
real 0m0.003s
real 0m0.003s
real 0m0.004s
real 0m0.013s
real 0m0.053s
real 0m1.993s
$ for p in $(seq 7); do time shuf-new -n10 10p$p.in >/dev/null; done
real 0m0.006s
real 0m0.004s
real 0m0.004s
real 0m0.005s
real 0m0.019s
real 0m0.082s
real 0m0.859s
So the above shows that reservoir sampling introduces CPU overhead
for <= 1 million lines of input, but once mem usage is significantly
reduced at 10 million lines, we get CPU/cache benefits there too.
Drilling in a bit to see the CPU overhead introduced for smaller inputs
(10p6.in in this case)...
$ for p in 1 10 12 15 20; do time shuf-old 10p6.in -n $((2**$p)) > /dev/null;
done
real 0m0.048s
real 0m0.044s
real 0m0.055s
real 0m0.137s
real 0m0.246s
$ for p in 1 10 12 15 20; do time shuf-new 10p6.in -n $((2**$p)) > /dev/null;
done
real 0m0.096s
real 0m0.084s
real 0m0.088s
real 0m0.103s
real 0m0.509s
So we can see nearly a doubling of the runtime.
I suppose we might be more dynamic here and not bother with stating files,
instead switching to reservoir only when we reach 1 million lines or so.
That would have the benefit of appropriately handling all inputs including
pipes?
If it's simple enough I'll do that tomorrow instead.
I'll squash all these to a single commit, and merge tomorrow hopefully.
> (and the other patch is the uniform-distribution randomness test).
I'll keep this separate for the moment.
thanks,
Pádraig,
shuf-reservoir-fixes.patch
Description: Text Data