coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] added ability in sort to skip n number of lines for each


From: Assaf Gordon
Subject: Re: [coreutils] added ability in sort to skip n number of lines for each file
Date: Mon, 22 Nov 2010 15:20:07 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101030 Icedove/3.0.10

Hello Jim and all,


On 11/18/2010 11:36 AM, Jim Hester wrote:

A common problem when sorting files stems from the file containing 1
or more header lines, which should not be sorted.
I'm also very much interested in a "header-aware" sort operation.
However, I've found "sort" to require slightly more complicated solution to have a stable "sort" operation, due to internal implementation of splitting files and merging them later.



  I have made a simple patch to
implement this feature, which I have attached to this email.
At the very list, I think that the following lines in your patch:
+        case 'l':
+          specify_nline_skip(oi,c,optarg);
+          break;

Should be changed to:
+        case 'l':
+         nline_skip = specify_nline_skip(oi,c,optarg);
+          break;

Otherwise the "nline_skip" variable stays at 0 and no lines are skipped.


But, your patch works only as long as all the sorting is done in-memory, and never goes into the temporary files + merging flow.

Here's a demonstration of the problem:

1. Create a file containing numbers from 1-1M, three times, with a header line. $ (echo "42_header" ; seq 1 1000000 ; seq 1 1000000 ; seq 1 1000000) > input_with_header.txt

2. Sort with regular (unpatched) sort, all is well (obviously, the header line will be sorted as a number, not appear as the first line):
   $ sort -n input_with_header.txt | head -n 5
   1
   1
   1
   2
   2

3. sort with regular sort, limit memory to 5M (forcing sort to use temporary files), all is still well:
   $ sort -S 5M -n input_with_header.txt | head -n 5
   1
   1
   1
   2
   2

4. Sort with your patched sort, sorting done in-memory (because the file is about 20MB and the default buffer is 500MB, IIRC) - all is well, the header line is maintained as first line:

   $ sort-header -l 1 -n input_with_header.txt | head -n 5
   42_header
   1
   1
   1
   2

5. But sort with your patched sort, limit memory to 5MB (forcing temporary files + merging), the output is incorrect:
 42_header
 1
 2
 3
 4

----

I do not mean to discourage you, as I find the header sorting (and joining) to be much needed. But I suspect a correct implementation will be more complicated.

As a work-around, we're using a shell script that accepts most (not all) of sort's options, "steals" the first couple of header lines, then pass the rest of the output to sort. Unlike Padraig's suggested solution, this script supports sorting from a pipe/STDIN.

This is the script:
http://cancan.cshl.edu/labmembers/gordon/files/sort-header

It's far from complete, and if anyone has suggestion or comments about it - they are welcomed. (It also assumes the input is tab-delimited, not white-space delimited, which is fine for my purposes).

regards,
 -gordon






reply via email to

[Prev in Thread] Current Thread [Next in Thread]