bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort suggestion : --head/tail n option


From: Philippe De Muyter
Subject: Re: sort suggestion : --head/tail n option
Date: Fri, 23 Apr 2004 12:07:31 +0200 (CEST)

Paul Eggert wrote :
> "Philippe De Muyter" <address@hidden> writes:
> 
> > Do you think that this new option would fit easily in the current `sort'
> > implementation, and how ?
> 
> It could be done and would I think be useful, though there would be a
> few issues.
> 
> 1.  You need to handle the case of "ties" correctly.  A tie occurs if
> two lines compare equal: this can happen even if their byte-for-byte
> contents differ.  In this case, you want the output of "sort --tail
> 1000" to be equivalent to "sort | tail -n 1000".  As you're sorting if
> a new "tie" comes in for the first line in the tail, you need to
> discard that first line and insert the "tie" in the correct position
> somewhere in the "tail".  Ties for "sort --head" are easier, as you
> can simply discard new ties for the last line in the head.

What's the current output order for "ties" ? Input order ? or something else ?
Does the 'compare' function take care of that ?

> 
> 2.  Have you looked at the literature for the best data structure to
> solve your problem?  For example, the people doing simulations have
> worked hard on algorithms to implement priority queues (where the most
> important operations are "delete min" and "insert"), and this seems to
> be close to what you need.  Splay heaps seem to be one of the current
> champions in that area.

I had already implemented my ad-hoc test program using a binary heap.
I will look at splay heaps it you think that could be better.

> 
> 3 (optional).  There should be no arbitrary limits, so ideally the
> performance should be good even if N is quite large, so large that the
> number of saved lines doesn't fit into memory.  However, this might
> require adding major structural complexity to your algorithm so
> perhaps it isn't worth doing.

If there's a portable way to detect we overflow a reasonable memory usage,
then we could revert to the normal algorithm and output only the first/last
N lines.  Handling the '-u' option should also happen that way.

> 
> Anyway, thanks for looking into the problem.
> 

Thanks for your interest

Philippe




reply via email to

[Prev in Thread] Current Thread [Next in Thread]