bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [PATCH] Makes sort create random order


From: Paul Eggert
Subject: Re: [PATCH] Makes sort create random order
Date: Wed, 01 Sep 2004 23:47:43 -0700
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

Thomas Habets <address@hidden> writes:

>>   sort: Add an ordering option -R that causes 'sort' to sort according
>>     to a random permutation of the correct sort order.
>
> This means that two different files, that happen to sort to the same output,
> should give the same output when randomized with the same SEED. Is that
> right? [*]

Sort of, but not quite.  The idea was that if the input is this:

1 3 2 3.0 4 3.000 5

and we are using sort -n so the correct sort order is this:

1 2 3     4 5
    3.0
    3.000

(where I've put "3", "3.0", and "3.000" atop each other to indicate
that these three values are all tied), then sort -R would permute this
order randomly, e.g. to:

3     1 4 5 2
3.0
3.000

and then sort as if that were the correct order.  Part of the idea is
that all the "3", "3.0", and "3.000" values will sort together, even
when sorting "randomly", since they are all the same number.

> I'm not a sorting-expert, but this would mean running through the
> input twice, would it not?

Perhaps, perhaps not.  Let's figure out the proper semantics first
before we worry about implementation details.

> Is there a good reason for wanting this?

By "this" do you mean "a fairly-formal definition", or "this
particular definition of random sorting"?  If the former, we need to
define what "sort randomly" means, in a way that would satisfy someone
trying to specify it fairly formally, and so that others can implement
it according to our spec.  If the latter, then because we want sort -R
to have the usual properties that people expect from "sort", e.g.,
"sort -rR" should output in the reverse order of "sort -R".

>>     if you sort a permutation of the same input file
>>     with the same --random-seed=SEED option twice, you'll get the same
>>     output. [**]
>
> Here however it does not explicitly say what I said above about two different
> files.

If two  files sort  to the same  output, then they're  permutations of
each other.  So  [**] implies [*].  (The converse  does not hold.  See
what I mean about the logic being tricky here?...)

> So what is "random"? Do we mean "arbitrary" or "unpredictable"?

That's a deep subject.  I suggest Knuth Volume 2.
http://www-cs-faculty.stanford.edu/~knuth/taocp.html

> Also, should "deteministic" here mean that it should now, and forever, and on
> all architectures, work the same?

Yes.




reply via email to

[Prev in Thread] Current Thread [Next in Thread]