coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: potential feature addition to coreutils' sort.c: print at most N lin


From: James Dowdell
Subject: Re: potential feature addition to coreutils' sort.c: print at most N lines
Date: Tue, 5 Mar 2013 17:26:56 -0800

Pádraig,

Thanks for the pointers.  I am slowly making my way through sort.c,
and preparing a patch; I estimate I'll have something ready in a few
days or weeks.

One final question before I go make this thing, regarding the thread
conversation you included from 2004, in which Philippe De Muyter
suggested he already had a functional patch involving a heap.  Is
there a code change I should be starting from, or has that been lost /
never submitted?

Originally I also thought that the 'optimal' solution would involve
heaps in some way, as everyone in that thread suspected, but I've
surprised myself by discovering an n-log-k algorithm based off of
merge sort, which I think would be much easier to dovetail with the
existing code in sort.c and may actually work better too.  If there is
no patch to begin from, I'll be preparing my patch based on that
unless someone objects.

-James

On Sun, Mar 3, 2013 at 4:47 PM, Pádraig Brady <address@hidden> wrote:
>
> On 03/03/2013 05:32 PM, James Dowdell wrote:
> > I'm considering writing a patch for sort.c to add a new feature, related to 
> > a stackoverflow inquiry I wrote 
> > (http://stackoverflow.com/questions/14882897/what-standard-commands-can-i-use-to-print-just-the-first-few-lines-of-sorted-out).
> >
> > This would be my first patch, and this is my first time messaging a gnu 
> > list; apologies if I'm "doing it wrong."
> >
> > I use GNU sort a lot, and routinely find myself in the situation of 
> > executing, e.g.:
> >
> > $ sort ... | head -n 1000
> >
> > This can be very unnecessarily slow when the input is huge, because sort 
> > does a lot of work that head throws away.
> >
> > I propose a new parameter, "-H, --head=NLINES", which has sort only print 
> > at most NLINES of output.  More than just a filter at the end like | head, 
> > it would avoid unnecessary sorting on more than NLINES of output.
> >
> > I want to know the procedure for submitting a patch, and the likelihood 
> > that such a patch would even be considered, before I spend time to parse 
> > the whole sort.c file and propose a complete and rigorous solution (which 
> > would be analogous to submitting the patch).  From a quick glance at the 
> > source, my current strategy would be to alter the merge nodes when this 
> > parameter is set so that the number of lines listed per node is clamped to 
> > NLINES.  While less efficient than an ideal solution, it would be more 
> > efficient than what's currently in place, and has the benefits of minimal 
> > code edits and negligible negative performance impact on mainstream use 
> > when the parameter is not passed.
> >
> > All feedback welcome, thank you.
>
> There is general agreement that this is worthwhile.
>
> Please read these first:
>   http://lists.gnu.org/archive/html/bug-coreutils/2004-04/msg00157.html
>   http://lists.gnu.org/archive/html/bug-coreutils/2009-07/msg00019.html
>
> As for contributing the patch, it would be much appreciated.
>
> For contribution details, please see the HACKING file:
> http://git.sv.gnu.org/cgit/coreutils.git/plain/HACKING
>
> In summary you would submit a patch against the latest git tree,
> to address@hidden.  Also for a patch of this significance,
> you would need to follow the copyright assignment procedure.
>
> thanks!
> Pádraig.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]