bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#10985: sort -k behavior possible problem: field span across the boun


From: Eric Blake
Subject: bug#10985: sort -k behavior possible problem: field span across the boundaries
Date: Fri, 09 Mar 2012 13:20:48 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:10.0.1) Gecko/20120216 Thunderbird/10.0.1

tag 10985 notabug
thanks

On 03/09/2012 12:46 PM, Oleg Moskalenko wrote:
> Hi
> 
> While testing different GNU coreutils sort versions on different platforms 
> (Linux and FreeBSD) I found that some behavior is probably not what a utility 
> user expects.

Thanks for the report.  However, you probably found behavior that is
required by POSIX.

> 
> Let's, say, we have to sort (numerically stable) just two lines:
> 
> $ sort -t "|" -ns -k2.3,2.7 <<!
> 1|234
> 1|2|34
> !

Let's use 'sort --debug' to see what really happened:

$ LC_ALL=C sort --debug -t\| -ns -k2.3,2.7 <<a
> 1|234
> 1|2|34
> a
sort: using simple byte comparison
1|234
    _
1|2|34
    __

So this sorted by locating the start of the second field ("234" of one
line, and "2|34" of the other line), then starting at the 3rd byte past
that location (even if it is in the next field).

This behavior is required by POSIX:

http://pubs.opengroup.org/onlinepubs/9699919799/utilities/sort.html

> 
> The correct output (from my point of view) must be:
> 
> 1|2|34
> 1|234

Sorry, but that interpretation does not match POSIX.

> 
> My reasoning is that applying the key specs "-k2.3,2.7" to string "1|234" we 
> obtain the key "4", and applying the same key to the string "1|2|34" we must 
> obtain "" (empty string),

That's where you are wrong.  POSIX states:

>> The notation:
>> 
>> -k field_start[type][,field_end[type]]
>> 
>> shall define a key field that begins at field_start and ends at field_end 
>> inclusive, unless field_start falls beyond the end of the line or after 
>> field_end, in which case the key field is empty. A missing field_end shall 
>> mean the last character of the line.
>> 
>> A field comprises a maximal sequence of non-separating characters and, in 
>> the absence of option -t, any preceding field separator.
>> 
>> The field_start portion of the keydef option-argument shall have the form:
>> 
>> field_number[.first_character]
>> 
>> Fields and characters within fields shall be numbered starting with 1. The 
>> field_number and first_character pieces, interpreted as positive decimal 
>> integers, shall specify the first character to be used as part of a sort 
>> key. If .first_character is omitted, it shall refer to the first character 
>> of the field.

That is, the field_start 2.3 means to start at the third character past
the second field, regardless if any intermediate field separators are
located, and that _only_ the end of a line (and not another field
separator) can result in an empty key field.

> 
> I do not know whether this is an intended behavior or a bug,

Intended and mandated by the standards.

> but this is definitely non-intuitive and not what a reasonable user would 
> expect.

Perhaps so, but if you want it changed, you need to file a bug report
against POSIX.  As such, I'm going to close out this coreutils bug.

-- 
Eric Blake   address@hidden    +1-919-301-3266
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]