bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Frederik Eaton
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Thu, 2 Jun 2005 14:16:10 -0700
User-agent: Mutt/1.5.9i

James> I think the consensus is that the functionality belongs in "sort".
James> Beyond that things are a bit less clear.  However, Paul put forward a
James> proposed usage which adapts the current -k option (see
James> http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00179.html).
James> Nobody made any comments suggesting that that wasn't a good way of
James> doing things.

I liked my proposal for co-opting "-s":

http://lists.gnu.org/archive/html/bug-coreutils/2005-05/msg00220.html

In other words, rather than having a "random" column, you'd have a
"row number" column, which would turn into a random column under the
hash operation. The documentation for '-s' would be changed to suggest
that it was enabling a comparison based on input row number.

James> Last time this was discussed on this list there was some talk about
James> selecting permutations with a seed.  This makes me uncomfortable since
James> though doubtless it would be convenient.  The problem is that the
James> number of permutations of the records of most files is larger than
James> 2^64, and so representing the seed and making use of it would be very
James> tricky.  I then had a look at Fascicle 2 of volume 4 of Knuth, but it
James> concentrates on generating all permutations, and so wasn't directly
James> much help.

This seems to be a common point of confusion which I think needs to be
cleared up. The mapping from random seed to sample space is rarely
surjective - in fact, whenever you sample more than a few numbers from
a random number generator, it isn't. It only matters that the number
of possible seeds is much larger than the number of samplings we would
ever choose to perform, and if not 2^64 then definitely 2^128 will be
sufficient for all eternity.

(Maybe some theoretical common ground is needed. A random variable is a
function X from a measurable space with probability measure P to
another measurable space such that X^{-1}(B) is measurable for every
measurable B in the range of X. We can define a probability measure on
the range of X by Q(B) = P(X^{-1}(B)), called the distribution of X.

Usually the domain of X is taken to be unspecified and uncountable,
but since in practice we only ever sample a finite subset of it - and
since we cannot calculate the inverse of X at arbitrary points - we
can replace the domain with a suitably large finite set with discrete
probability measure and still get good sample coverage, independently
of the size of the range of X. This new domain is the seed domain.)

Phil> Paul's statement:
Phil> > It's not as easy to implement as it sounds, I'm afraid; even if you
Phil> > can assume /dev/random or /dev/urandom.
Phil> 
Phil> is a slight concern, but I imagine the problems are applicable to *any*
Phil> shuffling utility. Is it that the app must guarantee all lines of a
Phil> non-seekable stdin must have an equal chance of any sort order?

See my comment to James above. I think one need not make this
guarantee, since only a tiny fraction of possible sort orders will be
able to be tried by the user. However, I think it would be true for my
proposal (and the one that James suggested to Davis) if one took the
random number generator or hash function to be variable or unspecified
during analysis of the algorithm. The fact that we ship with a
particular RNG or hash function (of all uncountable possibilities,
some unimplementable) hard-coded is not a problem since its resources
are never exhausted.

Jim> It looks like there are some desirable features that can be
Jim> provided only by a shuffle-enabled program that is key-aware.
Jim> Key specification and the comparison code are already part of sort.
Jim> Obviously, duplicating all of that in a separate program is not
Jim> an option.  I don't relish the idea of factoring out sort's line-
Jim> and key-handling code either, but it might be feasible.
Jim> 
Jim> However, I do like the idea of a new program that simply outputs
Jim> a random permutation of its input records, and that does it well,
Jim> and repeatably.  The Unix tool philosophy certainly does encourage
Jim> the `perform one task and do it well' approach.  Since doing it
Jim> well includes handling input larger than available virtual memory,
Jim> this is not trivial -- and it is well suited to the coreutils,
Jim> i.e., it's not easily implementable as a script.
Jim> 
Jim> Initially, I was inclined to say that adding both the new program
Jim> (no key support) and related functionality to sort was desirable.
Jim> Thinking of the limits of robustness of such a new program, I
Jim> realized that if the input is sufficiently large and not seekable
Jim> (e.g., from a pipe), then the program will have to resort to writing
Jim> temporary files, much as sort already does.  More duplicated effort,
Jim> determining how much memory to use (like sort's --buffer-size=SIZE
Jim> option), managing the temporary files, ensuring that they're removed
Jim> upon interrupt, etc.  But maybe not prohibitive.  The new program
Jim> would also have to have an option like sort's -z, --zero-terminated
Jim> option, and --temporary-directory=DIR, and --output=FILE.  In effect,
Jim> it would need all of sort's options that don't relate to sorting.

I did have some sympathy for Davis' point that a specialized 'shuffle'
can be implemented much more efficiently than one based on an
algorithm tuned to mergesort... Although I don't see a problem with
two utilities if someone wants to write the second one, and it is
correct, I would tend to agree with you and prefer a key-aware version
as a priority.

However, I just realized that the algorithm James suggested to Davis
is wrong for the same reason that the original random patch I
responded to is wrong:

http://lists.gnu.org/archive/html/bug-coreutils/2005-01/msg00145.html

I'm sorry for not noticing this earlier.

One possibility for an efficient random permutation of a large file is
a program which scans the file once and records the location of each
line in an index, then applies a uniformly distributed random
permutation to this line index by stepping through it in order and
swapping the current entry with an entry chosen at random from the set
of later entries, and finally steps through the index again and
dereferences each line entry and prints out the corresponding line in
the original file.

Note that there is a randomize-lines package in Debian by Arthur de
Jong, I've cc'ed him. It's much faster than 'sort' on small files, but
doesn't handle large files at all.

Jim> So implementing a robust shuffle program, even one without key
Jim> handling capabilities, would require much of the infrastructure
Jim> already present in sort.c.
Jim>
Jim> It sure sounds like shuffle and sort should share a lot of code,
Jim> one way or another, so why not have them share the line- and key-
Jim> handling code, too?  I won't rule out adding a new program, like
Jim> shuffle, but I confess I'm less inclined now than when I started
Jim> typing this message.

Well, it would only be line-handling code, and whatever else, but no
key-handling code since Davis' shuffle wouldn't be key-aware, unless
you're talking about my proposal. And maybe common things could be put
in a library, I don't know. I am against duplication but not against
change.

Frederik




reply via email to

[Prev in Thread] Current Thread [Next in Thread]