coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shuf reservoir sampling


From: Bernhard Voelker
Subject: Re: Shuf reservoir sampling
Date: Sat, 28 Dec 2013 18:18:27 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:24.0) Gecko/20100101 Thunderbird/24.2.0

Hi Jyothis,

On 12/28/2013 03:36 PM, Jyothis V wrote:
> On Sat, 28 Dec 2013 13:49:53 +0100
> Bernhard Voelker <address@hidden> wrote:
> 
>> On 12/28/2013 10:53 AM, Jyothis V wrote:
>>> I would like to know why shuf.c is using reservoir sampling +
>>> write_permuted_output_reservoir rather than just using an
>>> inside-out version Fisher-Yates shuffle.
>>
>> Reservoir sampling has been added to shuf in the youngest coreutils
>> release 8.22 via this commit:
>>
>>   http://git.sv.gnu.org/cgit/coreutils.git/commit/?id=20d7bce0

CCing Assaf who is the author of that commit.

> Hi, thanks for the reply. I understand why something like reservoir
> sampling is needed. But in shuf.c, shuffling is done in two steps:
> 1) using reservoir sampling, an array of length head_length is obtained.
> At this stage, the array is not completely shuffled because the
> first head_length elements were copied verbatim. 2) This array is
> then randomly permuted while writing. My question is whether these
> two steps could be clubbed together, just as shown in the second
> algorithm given in the wikipedia page you mentioned.

I didn't have a look into the Maths behind yet, nor was I involved
during that last improvement. Further improvement is maybe possible,
and the best way to push this is providing code. Are you willing
to propose such a patch?

Thanks & have a nice day,
Berny



reply via email to

[Prev in Thread] Current Thread [Next in Thread]