Re: [coreutils] added ability in sort to skip n number of lines for each

coreutils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [coreutils] added ability in sort to skip n number of lines for each

From:	Assaf Gordon
Subject:	Re: [coreutils] added ability in sort to skip n number of lines for each file
Date:	Mon, 22 Nov 2010 15:20:07 -0500
User-agent:	Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.15) Gecko/20101030 Icedove/3.0.10

Hello Jim and all,


On 11/18/2010 11:36 AM, Jim Hester wrote:

A common problem when sorting files stems from the file containing 1
or more header lines, which should not be sorted.

I'm also very much interested in a "header-aware" sort operation.

However, I've found "sort" to require slightly more complicated solutionto have a stable "sort" operation, due to internal implementation ofsplitting files and merging them later.

  I have made a simple patch to
implement this feature, which I have attached to this email.

At the very list, I think that the following lines in your patch:
+        case 'l':
+          specify_nline_skip(oi,c,optarg);
+          break;

Should be changed to:
+        case 'l':
+         nline_skip = specify_nline_skip(oi,c,optarg);
+          break;

Otherwise the "nline_skip" variable stays at 0 and no lines are skipped.

But, your patch works only as long as all the sorting is done in-memory,and never goes into the temporary files + merging flow.


Here's a demonstration of the problem:

1. Create a file containing numbers from 1-1M, three times, with aheader line.$ (echo "42_header" ; seq 1 1000000 ; seq 1 1000000 ; seq 1 1000000)> input_with_header.txt

2. Sort with regular (unpatched) sort, all is well (obviously, theheader line will be sorted as a number, not appear as the first line):

   $ sort -n input_with_header.txt | head -n 5
   1
   1
   1
   2
   2

3. sort with regular sort, limit memory to 5M (forcing sort to usetemporary files), all is still well:

   $ sort -S 5M -n input_with_header.txt | head -n 5
   1
   1
   1
   2
   2

4. Sort with your patched sort, sorting done in-memory (because the fileis about 20MB and the default buffer is 500MB, IIRC) - all is well, theheader line is maintained as first line:


   $ sort-header -l 1 -n input_with_header.txt | head -n 5
   42_header
   1
   1
   1
   2

5. But sort with your patched sort, limit memory to 5MB (forcingtemporary files + merging), the output is incorrect:

 42_header
 1
 2
 3
 4

----

I do not mean to discourage you, as I find the header sorting (andjoining) to be much needed. But I suspect a correct implementation willbe more complicated.

As a work-around, we're using a shell script that accepts most (not all)of sort's options, "steals" the first couple of header lines, then passthe rest of the output to sort.Unlike Padraig's suggested solution, this script supports sorting from apipe/STDIN.


This is the script:
http://cancan.cshl.edu/labmembers/gordon/files/sort-header

It's far from complete, and if anyone has suggestion or comments aboutit - they are welcomed.(It also assumes the input is tab-delimited, not white-space delimited,which is fine for my purposes).


regards,
 -gordon

[Prev in Thread]

Current Thread

[Next in Thread]

[coreutils] added ability in sort to skip n number of lines for each file, Jim Hester, 2010/11/18
- Re: [coreutils] added ability in sort to skip n number of lines for each file, Pádraig Brady, 2010/11/22
  - Re: [coreutils] added ability in sort to skip n number of lines for each file, Pádraig Brady, 2010/11/22
    - Re: [coreutils] added ability in sort to skip n number of lines for each file, Pádraig Brady, 2010/11/22
    - Re: [coreutils] added ability in sort to skip n number of lines for each file, Jim Hester, 2010/11/23
    - Re: [coreutils] added ability in sort to skip n number of lines for each file, Pádraig Brady, 2010/11/23
- Re: [coreutils] added ability in sort to skip n number of lines for each file, Assaf Gordon <=
  - Re: [coreutils] added ability in sort to skip n number of lines for each file, Assaf Gordon, 2010/11/22

Prev by Date: Re: [coreutils] added ability in sort to skip n number of lines for each file
Next by Date: Re: [coreutils] added ability in sort to skip n number of lines for each file
Previous by thread: Re: [coreutils] added ability in sort to skip n number of lines for each file
Next by thread: Re: [coreutils] added ability in sort to skip n number of lines for each file
Index(es):
- Date
- Thread