[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Shuf reservoir sampling

From: Jyothis V
Subject: Re: Shuf reservoir sampling
Date: Sat, 28 Dec 2013 22:43:56 +0530

Hi, thanks for the reply. I understand why something like reservoir sampling is 
needed. But in shuf.c, shuffling is done in two steps: 1) using reservoir 
sampling, an array of length head_length is obtained. At this stage, the array 
is not completely shuffled because the first head_length elements were copied 
verbatim. 2) This array is then randomly permuted while writing. My question is 
whether these two steps could be clubbed together, just as shown in the second 
algorithm given in the wikipedia page you mentioned.


On Sat, 28 Dec 2013 13:49:53 +0100
Bernhard Voelker <address@hidden> wrote:

> On 12/28/2013 10:53 AM, Jyothis V wrote:
> > I would like to know why shuf.c is using reservoir sampling +
> > write_permuted_output_reservoir rather than just using an
> > inside-out version Fisher-Yates shuffle.
> Reservoir sampling has been added to shuf in the youngest coreutils
> release 8.22 via this commit:
> The NEWS file tells us why this is more efficient than before:
>   shuf outputs subsets of large inputs much more efficiently.
>   Reservoir sampling is used to limit memory usage based on the number of
>   outputs, rather than the number of inputs.
> This advantage is also mentioned on Wikipedia:
>   [...]
>   Note that although the rest of the cards are shuffled, only the top k are
>   important in the present context. Therefore, the array a need only track
>   the cards in the top k positions while performing the shuffle, reducing
>   the amount of memory needed. Truncating a to length k, the algorithm is
>   modified accordingly:
>   [...]
> Have a nice day,
> Berny

Jyothis V <address@hidden>

reply via email to

[Prev in Thread] Current Thread [Next in Thread]