[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [coreutils] added ability in sort to skip n number of lines for each
Re: [coreutils] added ability in sort to skip n number of lines for each file
Mon, 22 Nov 2010 15:25:39 -0500
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:18.104.22.168) Gecko/20101030 Icedove/3.0.10
Sorry, the command for step 5 was missing:
$ sort-header -S 5M -l 1 -n input_with_header.txt | head -n 5
(When "sort-header" is sort from coreutils version 8.7 patched with Jim
On 11/22/2010 03:20 PM, Assaf Gordon wrote:
Hello Jim and all,
On 11/18/2010 11:36 AM, Jim Hester wrote:
A common problem when sorting files stems from the file containing 1
or more header lines, which should not be sorted.
I'm also very much interested in a "header-aware" sort operation.
However, I've found "sort" to require slightly more complicated
solution to have a stable "sort" operation, due to internal
implementation of splitting files and merging them later.
I have made a simple patch to
implement this feature, which I have attached to this email.
At the very list, I think that the following lines in your patch:
+ case 'l':
Should be changed to:
+ case 'l':
+ nline_skip = specify_nline_skip(oi,c,optarg);
Otherwise the "nline_skip" variable stays at 0 and no lines are skipped.
But, your patch works only as long as all the sorting is done
in-memory, and never goes into the temporary files + merging flow.
Here's a demonstration of the problem:
1. Create a file containing numbers from 1-1M, three times, with a
$ (echo "42_header" ; seq 1 1000000 ; seq 1 1000000 ; seq 1
1000000) > input_with_header.txt
2. Sort with regular (unpatched) sort, all is well (obviously, the
header line will be sorted as a number, not appear as the first line):
$ sort -n input_with_header.txt | head -n 5
3. sort with regular sort, limit memory to 5M (forcing sort to use
temporary files), all is still well:
$ sort -S 5M -n input_with_header.txt | head -n 5
4. Sort with your patched sort, sorting done in-memory (because the
file is about 20MB and the default buffer is 500MB, IIRC) - all is
well, the header line is maintained as first line:
$ sort-header -l 1 -n input_with_header.txt | head -n 5
5. But sort with your patched sort, limit memory to 5MB (forcing
temporary files + merging), the output is incorrect:
I do not mean to discourage you, as I find the header sorting (and
joining) to be much needed. But I suspect a correct implementation
will be more complicated.
As a work-around, we're using a shell script that accepts most (not
all) of sort's options, "steals" the first couple of header lines,
then pass the rest of the output to sort.
Unlike Padraig's suggested solution, this script supports sorting from
This is the script:
It's far from complete, and if anyone has suggestion or comments about
it - they are welcomed.
(It also assumes the input is tab-delimited, not white-space
delimited, which is fine for my purposes).
Re: [coreutils] added ability in sort to skip n number of lines for each file, Assaf Gordon, 2010/11/22
- Re: [coreutils] added ability in sort to skip n number of lines for each file,
Assaf Gordon <=