[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new coreutil? shuffle - randomize file contents
From: |
Davis Houlton |
Subject: |
Re: new coreutil? shuffle - randomize file contents |
Date: |
Fri, 3 Jun 2005 06:27:18 +0000 |
User-agent: |
KMail/1.7.2 |
> One possibility for an efficient random permutation of a large file is
> a program which scans the file once and records the location of each
> line in an index, then applies a uniformly distributed random
> permutation to this line index by stepping through it in order and
> swapping the current entry with an entry chosen at random from the set
> of later entries, and finally steps through the index again and
> dereferences each line entry and prints out the corresponding line in
> the original file.
Interesting points Frederik! Glossing over some trivial special case details
and taking a few syntax liberties, a 'back-of-napkin' style shuffle algorithm
supporting N sized data sets might look like....
struct index
index is a structure that allows us to fseek and read a given segement of data
from the given file into memory
file
start
len
calc_lines(temp_file_num): Lets assume indexes are stored not in memory, but
in temp files 0...X, and each temp file has a maximum amount of index
structures INDEX_MAX. This function, given a temp file counter, calculates
the number of indexes in all previous temp files. I'm using this here
primarily as a convention to simplify the psuedo-code below.
random(data_size): Makes data_size/RAND_MAX calls to rand to make sure our
random number generator covers the entire data set.
then the shuffle routine:
while input_size>=0
random_line=random(input_size)
//Figure out which temp files have our data
input_size_file=input_size/INDEX_MAX
random_line_file=random_line/INDEX_MAX
//Figure out how many actual indexes we need to skip forward
//For example, if we can store 10 indexes per temp file, and we
//want index 18, we'd only skip over 7 lines in file 1.
input_size_pos =
sizeof(struct index) *
(input_size-calc_lines(input_size_file-1))
random_line_pos=
sizeof(struct index)*
(random_line-calc_lines(random_line_file-1))
//Grab last index
open file input_size_file
seek forward input_size_pos
read in struct index last_line_index
close input_size_file
//Grab random line index, and replace
//With last index
open file random_line_file
seek forward random_line_pos
read in struct index random_line_index
seek backward sizeof(struct index)
write last_line_index
close random_line_file
//Write out random line
open file index.source
seek index.begin
print (read in index.len bytes)
close file index.source
input_size--
The primary advantage above is only two passes are made; one to create the
indexes, and one to randomize the output. Even for very large files, I don't
know what the end efficiency is (I suspect not much), but I guess every
little bit helps.
Obviously, there are some gross simplifications...for example, most numbers
would be composed of two longs; I'm assuming we'll have to deal with line
counts > than LONG_MAX, so each line position would be composed of one long
for a page offset (for files > LONG_MAX in data-set size) and one for the
actual offset (when on the current page). That may be overkill, and if it is,
using just longs would greatly simplify implementation. Even then, we can't
handle infinitely large files--is that a true requirement?
Thanks,
Davis
- Re: new coreutil? shuffle - randomize file contents, (continued)
- Re: new coreutil? shuffle - randomize file contents, Philip Rowlands, 2005/06/02
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents, Paul Eggert, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, David Feuer, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, James Youngman, 2005/06/03
- Re: new coreutil? shuffle - randomize file contents,
Davis Houlton <=
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/04
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/05
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/05
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/06/06
- Re: new coreutil? shuffle - randomize file contents, Jim Meyering, 2005/06/07
Re: new coreutil? shuffle - randomize file contents, Jim Meyering, 2005/06/02