[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort suggestion : --head/tail n option
From: |
Philippe De Muyter |
Subject: |
Re: sort suggestion : --head/tail n option |
Date: |
Fri, 23 Apr 2004 12:07:31 +0200 (CEST) |
Paul Eggert wrote :
> "Philippe De Muyter" <address@hidden> writes:
>
> > Do you think that this new option would fit easily in the current `sort'
> > implementation, and how ?
>
> It could be done and would I think be useful, though there would be a
> few issues.
>
> 1. You need to handle the case of "ties" correctly. A tie occurs if
> two lines compare equal: this can happen even if their byte-for-byte
> contents differ. In this case, you want the output of "sort --tail
> 1000" to be equivalent to "sort | tail -n 1000". As you're sorting if
> a new "tie" comes in for the first line in the tail, you need to
> discard that first line and insert the "tie" in the correct position
> somewhere in the "tail". Ties for "sort --head" are easier, as you
> can simply discard new ties for the last line in the head.
What's the current output order for "ties" ? Input order ? or something else ?
Does the 'compare' function take care of that ?
>
> 2. Have you looked at the literature for the best data structure to
> solve your problem? For example, the people doing simulations have
> worked hard on algorithms to implement priority queues (where the most
> important operations are "delete min" and "insert"), and this seems to
> be close to what you need. Splay heaps seem to be one of the current
> champions in that area.
I had already implemented my ad-hoc test program using a binary heap.
I will look at splay heaps it you think that could be better.
>
> 3 (optional). There should be no arbitrary limits, so ideally the
> performance should be good even if N is quite large, so large that the
> number of saved lines doesn't fit into memory. However, this might
> require adding major structural complexity to your algorithm so
> perhaps it isn't worth doing.
If there's a portable way to detect we overflow a reasonable memory usage,
then we could revert to the normal algorithm and output only the first/last
N lines. Handling the '-u' option should also happen that way.
>
> Anyway, thanks for looking into the problem.
>
Thanks for your interest
Philippe