[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: new coreutil? shuffle - randomize file contents
From: |
Frederik Eaton |
Subject: |
Re: new coreutil? shuffle - randomize file contents |
Date: |
Mon, 30 May 2005 16:02:35 -0700 |
User-agent: |
Mutt/1.5.9i |
On Mon, May 30, 2005 at 09:25:45AM +0000, Davis Houlton wrote:
> Hi Frederik! I guess we're both a little confused :) My question is why would
> I sort AND shuffle in the same command? Are we talking sort the whole data
> set and shuffle a subset? I guess I'm having a hard time thinking why I would
> randomize via key--not saying that there aren't reasons, I'm just not sure
> what they are!
This is covered in the previous thread. The canonical example is
playing songs with albums shuffled, but with songs on each album
played together and in order.
> My premise is that shuffle is organized pretty differently than sort--the
> code
> I have (in addition to the code I imagine we'll need for large files) looks
> radically different than sort, if only because shuffling is vastly simpler.
>
> While we could graft a shuffle into sort--I must admit to have only taken a
> cursory glance at the sort source--I think we can gain greater efficiencies
> by keeping the logic paths separate. My assumption is thus the shuffling
> code will be it's own entity, whether it is in sort or shuffle.
It is true that shuffling can generally be done more efficiently than
sorting. I don't know if efficiency is a primary concern - I think
that the *ability* to handle multi-gigabyte files is important, but
since they come up so rarely, especially when the task is to shuffle
and not to sort, whether they are done in a minute or 30 minutes seems
inconsequential. But if you are already writing something which will
be able to handle large files well, I guess I personally don't see a
problem with including it in coreutils. The only thing is that what
you describe won't be able to handle all of the use cases that I had
in mind. I would still like to see 'sort' have an option to sort based
on a hash of keys since this would cover those.
> Looking at it a different way, lets take a look at the usage of sort and
> shuffle as a card metaphor. The way I sort a deck of cards--and my rather
> simple method is far from optimum--is to first spread the cards face up out
> on a table, look for some high cards of each suit, start a pile of the four
> suits, and then as I pull additional cards, place them in the proper order in
> each suit pile. When I'm done sometime later, I'm left with the four stacks
> of cards, each suit in the proper order.
>
> When I shuffle the resulting deck, however, I use a different process.
> Granted, I could spread all the cards on the table, mix them up "domino"
> style, and then place them randomly into one, or even four stacks. That
> would be acceptable. But what I do (following the grand tradition of card
> shark wannabes everywhere) is split the deck in half. I take each deck, and
> attempt to randomly merge them together like we've all seen those Las Vegas
> dealers do on tv, and voila--I have now (in theory) randomized the deck. It's
> quicker and just as effective as the table spread method.
>
> If we are willing to ignore the imperfections of the analogy--that Vegas
> dealers shuffle their cards 7 times, that I have a tendency to mangle cards
> with improper shuffling technique, etc--my thinking is that it makes sense to
> have sort and shuffle remain separate on an intuitive level. And I admit, it
> is true, it is not hard to train a user in sort and shuffle commands. Had
> sort --random already existed, there would be no need to propose any
> separation. But if we accept as a given that the code will follow two
> different logic paths, I personally don't see maintenance gains from
> combining the two.
I hope that you aren't proposing an algorithm which is similar to
card-shuffling. That would be exactly like merge-sorting on a key hash
- i.e. no more efficient.
> I took a quick scan of the archive and it seemed like the conclusion
> was it is a good idea to keep shuffle functionality separate?
I believe it was concluded that two functionalities were needed - I
don't know what you mean by "separate".
Frederik
- Re: new coreutil? shuffle - randomize file contents, (continued)
- Re: new coreutil? shuffle - randomize file contents, James Youngman, 2005/05/23
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/05/25
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/05/24
- Re: new coreutil? shuffle - randomize file contents, Frederik Eaton, 2005/05/30
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/05/30
- Re: new coreutil? shuffle - randomize file contents,
Frederik Eaton <=
- Re: new coreutil? shuffle - randomize file contents, Davis Houlton, 2005/05/31
RE: new coreutil? shuffle - randomize file contents, Lemley James - jlemle, 2005/05/24