bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: new coreutil? shuffle - randomize file contents


From: Frederik Eaton
Subject: Re: new coreutil? shuffle - randomize file contents
Date: Mon, 30 May 2005 16:02:35 -0700
User-agent: Mutt/1.5.9i

On Mon, May 30, 2005 at 09:25:45AM +0000, Davis Houlton wrote:
> Hi Frederik! I guess we're both a little confused :) My question is why would 
> I sort AND shuffle in the same command? Are we talking sort the whole data 
> set and shuffle a subset? I guess I'm having a hard time thinking why I would 
> randomize via key--not saying that there aren't reasons, I'm just not sure 
> what they are! 

This is covered in the previous thread. The canonical example is
playing songs with albums shuffled, but with songs on each album
played together and in order.

> My premise is that shuffle is organized pretty differently than sort--the 
> code 
> I have (in addition to the code I imagine we'll need for large files) looks 
> radically different than sort, if only because shuffling is vastly simpler. 
> 
> While we could graft a shuffle into sort--I must admit to have only taken a 
> cursory glance at the sort source--I think we can gain greater efficiencies 
> by keeping the logic paths separate.  My assumption is thus the shuffling 
> code will be it's own entity, whether it is in sort or shuffle.  

It is true that shuffling can generally be done more efficiently than
sorting. I don't know if efficiency is a primary concern - I think
that the *ability* to handle multi-gigabyte files is important, but
since they come up so rarely, especially when the task is to shuffle
and not to sort, whether they are done in a minute or 30 minutes seems
inconsequential. But if you are already writing something which will
be able to handle large files well, I guess I personally don't see a
problem with including it in coreutils. The only thing is that what
you describe won't be able to handle all of the use cases that I had
in mind. I would still like to see 'sort' have an option to sort based
on a hash of keys since this would cover those.

> Looking at it a different way, lets take a look at the usage of sort and 
> shuffle as a card metaphor.  The way I sort a deck of cards--and my rather 
> simple method is far from optimum--is to first spread the cards face up out 
> on a table, look for some high cards of each suit, start a pile of the four 
> suits, and then as I pull additional cards, place them in the proper order in 
> each suit pile. When I'm done sometime later, I'm left with the four stacks 
> of cards, each suit in the proper order.  
> 
> When I shuffle the resulting deck, however, I use a different process. 
> Granted, I could spread all the cards on the table, mix them up "domino" 
> style, and then place them randomly into one, or even four stacks.  That 
> would be acceptable.  But what I do (following the grand tradition of card 
> shark wannabes everywhere) is split the deck in half.  I take each deck, and 
> attempt to randomly merge them together like we've all seen those Las Vegas 
> dealers do on tv, and voila--I have now (in theory) randomized the deck. It's 
> quicker and just as effective as the table spread method.
> 
> If we are willing to ignore the imperfections of the analogy--that Vegas 
> dealers shuffle their cards 7 times, that I have a tendency to mangle cards 
> with improper shuffling technique, etc--my thinking is that it makes sense to 
> have sort and shuffle remain separate on an intuitive level.  And I admit, it 
> is true, it is not hard to train a user in sort and shuffle commands.  Had 
> sort --random already existed, there would be no need to propose any 
> separation. But if we accept as a given that the code will follow two 
> different logic paths, I personally don't see maintenance gains from 
> combining the two.  

I hope that you aren't proposing an algorithm which is similar to
card-shuffling. That would be exactly like merge-sorting on a key hash
- i.e. no more efficient.

> I took a quick scan of the archive and it seemed like the conclusion
> was it is a good idea to keep shuffle functionality separate?

I believe it was concluded that two functionalities were needed - I
don't know what you mean by "separate".

Frederik




reply via email to

[Prev in Thread] Current Thread [Next in Thread]