[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#10985: sort -k behavior possible problem: field span across the boun
From: |
Eric Blake |
Subject: |
bug#10985: sort -k behavior possible problem: field span across the boundaries |
Date: |
Fri, 09 Mar 2012 13:20:48 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120216 Thunderbird/10.0.1 |
tag 10985 notabug
thanks
On 03/09/2012 12:46 PM, Oleg Moskalenko wrote:
> Hi
>
> While testing different GNU coreutils sort versions on different platforms
> (Linux and FreeBSD) I found that some behavior is probably not what a utility
> user expects.
Thanks for the report. However, you probably found behavior that is
required by POSIX.
>
> Let's, say, we have to sort (numerically stable) just two lines:
>
> $ sort -t "|" -ns -k2.3,2.7 <<!
> 1|234
> 1|2|34
> !
Let's use 'sort --debug' to see what really happened:
$ LC_ALL=C sort --debug -t\| -ns -k2.3,2.7 <<a
> 1|234
> 1|2|34
> a
sort: using simple byte comparison
1|234
_
1|2|34
__
So this sorted by locating the start of the second field ("234" of one
line, and "2|34" of the other line), then starting at the 3rd byte past
that location (even if it is in the next field).
This behavior is required by POSIX:
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html
>
> The correct output (from my point of view) must be:
>
> 1|2|34
> 1|234
Sorry, but that interpretation does not match POSIX.
>
> My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we
> obtain the key "4", and applying the same key to the string "1|2|34" we must
> obtain "" (empty string),
That's where you are wrong. POSIX states:
>> The notation:
>>
>> -k field_start[type][,field_end[type]]
>>
>> shall define a key field that begins at field_start and ends at field_end
>> inclusive, unless field_start falls beyond the end of the line or after
>> field_end, in which case the key field is empty. A missing field_end shall
>> mean the last character of the line.
>>
>> A field comprises a maximal sequence of non-separating characters and, in
>> the absence of option -t, any preceding field separator.
>>
>> The field_start portion of the keydef option-argument shall have the form:
>>
>> field_number[.first_character]
>>
>> Fields and characters within fields shall be numbered starting with 1. The
>> field_number and first_character pieces, interpreted as positive decimal
>> integers, shall specify the first character to be used as part of a sort
>> key. If .first_character is omitted, it shall refer to the first character
>> of the field.
That is, the field_start 2.3 means to start at the third character past
the second field, regardless if any intermediate field separators are
located, and that _only_ the end of a line (and not another field
separator) can result in an empty key field.
>
> I do not know whether this is an intended behavior or a bug,
Intended and mandated by the standards.
> but this is definitely non-intuitive and not what a reasonable user would
> expect.
Perhaps so, but if you want it changed, you need to file a bug report
against POSIX. As such, I'm going to close out this coreutils bug.
--
Eric Blake address@hidden +1-919-301-3266
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature