bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#14988: sort enhancement request


From: Eric Blake
Subject: bug#14988: sort enhancement request
Date: Wed, 31 Jul 2013 07:59:23 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:17.0) Gecko/20130625 Thunderbird/17.0.7

tag 14988 notabug
thanks

[re-adding the list; and please don't top-post on technical lists]

On 07/31/2013 07:19 AM, Danny Nicholas wrote:
> Thank you Eric.  We have two sorts on our system.  Our /usr/bin/sort does not 
> support the -s option,

Makes sense - the '-s' option is a GNU extension, and your /usr/bin/sort
is probably not GNU sort.  If you want stable sorting using only POSIX
features, then you have to supply enough sort keys so that no two lines
ever compare equal (since POSIX has no way to disable the full-line sort
of last resort).  And depending on your input to be sorted; this may
indeed require a pre-filter run that adds line numbering (by the way,
sed's '=' command can do this much more efficiently than a python
script), then sorting, then a post-filter run that removes the line number.

> but our /usr/local/bin/sort does.

Indeed - life is simpler if you can write your script to ensure that it
always sets PATH to use the full power of the GNU tools.

>  Unfortunately, that did not resolve the issue. Here is a portion of the file 
> I'm trying to sort

Thank you - THIS makes much more sense for understanding your problem.

> 010_000001_0000731_00001_200000081610_<Customer>
> 010_000001_0000731_00002_200000081610_     <CCODEPAGE>4102 LANGUAGE 
> EN</CCODEPAGE>
> 010_000001_0000731_00003_200000081610_     <FirstCopy>YES</FirstCopy>
> 010_000001_0000731_00003_200000081610_     <eapprovetype>010</eapprovetype>
> 010_000001_0000731_00003_200000081610_     
> <lastpaymentdate>06/12/2013</lastpaymentdate>
> 010_000001_0000731_00003_200000081610_     <lastpaymentamount>           
> 277.59</lastpaymentamount>
> 010_000001_0000731_00003_200000081610_     
> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
> 010_000001_0000731_00003_200000081610_     <CPAGENAME>PAGE1</CPAGENAME>
> 010_000001_0000731_00004_200000081610_     
> <DG_BILL_LAYOUT>REGULAR</DG_BILL_LAYOUT>
> 010_000001_0000731_00005_200000081610_     <DC-DEVICE>PRINTER</DC-DEVICE>
> 010_000001_0000731_00006_200000081610_     <DC-RDI>S</DC-RDI>
> 010_000001_0000731_00007_200000081610_     <DC-SENDTYPE>PRINTER</DC-SENDTYPE>
> 010_000001_0000731_00008_200000081610_     <DSY-SYSID>R3P</DSY-SYSID>
> 
> What I am executing is /usr/local/bin/sort -k 1,36 -s file -o file2

So, with "-k1,36" you asked sort to treat as its sort key the portion of
the line ranging from the first field to the 36th field.  I only see 2
fields in most of the lines (a few have more, but none of them with 36
fields), so you are basically sorting by the entire line.  You didn't
provide any other keys, but since your first key is already botched as
the ENTIRE line, there were no lines that compared equal for -s to make
any difference.  Again, sort --debug makes this clear (using a subset of
just two lines of your input):

>> $ printf '010_000001_0000731_00003_200000081610_     
>> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>\n010_000001_0000731_00003_200000081610_
>>      <CPAGENAME>PAGE1</CPAGENAME>\n' \
>>    | LC_ALL=C sort --debug -k1,36 -s
>> sort: using simple byte comparison
>> 010_000001_0000731_00003_200000081610_     <CPAGENAME>PAGE1</CPAGENAME>
>> _______________________________________________________________________
>> 010_000001_0000731_00003_200000081610_     
>> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
>> ________________________________________________________________________________________________________

But it appears that what you WANTED was to sort on just the first 36
bytes, with a stable sort of the results.  If so, then ASK for that, by
using the correct -k option:

>> $ printf '010_000001_0000731_00003_200000081610_     
>> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>\n010_000001_0000731_00003_200000081610_
>>      <CPAGENAME>PAGE1</CPAGENAME>\n' \
>>    | LC_ALL=C sort --debug -k1,1.36 -s
>> sort: using simple byte comparison
>> 010_000001_0000731_00003_200000081610_     
>> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
>> ____________________________________
>> 010_000001_0000731_00003_200000081610_     <CPAGENAME>PAGE1</CPAGENAME>
>> ____________________________________

Note how I asked for a sort key -k1,1.36, which says to start in the
first field, and end 36 bytes into the first field (hmm, it looks like
you actually want 38 bytes - but I'll leave that for you to decide).
Also note that -s now makes a difference, when the content of that first
sort key is identical so the last-resort full-line comparison swaps
unequal lines when -s is not used:

>> $ printf '010_000001_0000731_00003_200000081610_     
>> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>\n010_000001_0000731_00003_200000081610_
>>      <CPAGENAME>PAGE1</CPAGENAME>\n' \
>>    | LC_ALL=C sort --debug -k1,1.36
>> sort: using simple byte comparison
>> 010_000001_0000731_00003_200000081610_     <CPAGENAME>PAGE1</CPAGENAME>
>> ____________________________________
>> _______________________________________________________________________
>> 010_000001_0000731_00003_200000081610_     
>> <SuppressOutBadVariableCopies></SuppressOutBadVariableCopies>
>> ____________________________________
>> ________________________________________________________________________________________________________

As this is a case of you not passing the correct command line arguments,
rather than a bug in sort, I am marking this bug as closed.  However,
feel free to continue to comment on the topic (preferably on-list) if
you have more questions.

-- 
Eric Blake   eblake redhat com    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]