bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

coreutils "sort -b -kSTART,END.ENDCHAR" incompatibility with POSIX


From: Paul Eggert
Subject: coreutils "sort -b -kSTART,END.ENDCHAR" incompatibility with POSIX
Date: Sun, 25 Apr 2004 01:02:46 -0700
User-agent: Gnus/5.1006 (Gnus v5.10.6) Emacs/21.3 (gnu/linux)

Yesterday's email exchange with Stuart Allsop caused me to discover a
longstanding bug in coreutils "sort".  This bug causes "sort" to be
incompatible with both POSIX and traditional Unix sort.  Here's a
proposed patch.

2004-04-25  Paul Eggert  <address@hidden>

        Fix POSIX-conformance bug: "sort -k 3,3.5b" is supposed to skip
        leading blanks when computing the location of the field end;
        it is not supposed to skip trailing blanks.  Solaris 8 "sort"
        does conform to POSIX.  Also fix the documentation to clarify
        this and related issues.

        * doc/coreutils.texi (sort invocation): Mention -k earlier, so
        that the options are in alphabetical order.  Describe how -b works
        more-accurately; this involves fixing some examples, too.  Mention
        what happens if the start field falls after an end field or after
        a line end.  Warn about using -k without -b, -g, -M, -n, or -t.
        Add an example of how to sort IPv4 addresses and Apache Common
        Log Format dates.  Remove a duplicate example.
        (Putting the tools together): Use separate options rather
        than agglomerating them.
        * src/sort.c (limfield): Use skipeblanks, not skipsblanks, to
        decode whether to skip leading blanks.
        (trailing_blanks): Remove.
        (fillbuf, getmonth, keycompare): Don't trim trailing blanks.

        * tests/pr/Test.pm: Fix typo in env_default comment.
        * tests/sort/Test.pm: Likewise.
        (18c, 18d): Reverse the order of output lines, so that the
        test cases conform to POSIX.

Index: doc/coreutils.texi
===================================================================
RCS file: /home/meyering/coreutils/cu/doc/coreutils.texi,v
retrieving revision 1.178
diff -p -u -r1.178 coreutils.texi
--- doc/coreutils.texi  12 Apr 2004 09:22:02 -0000      1.178
+++ doc/coreutils.texi  25 Apr 2004 06:39:28 -0000
@@ -3248,6 +3248,17 @@ Other options are:
 
 @table @samp
 
address@hidden -k @var{pos1}[,@var{pos2}]
address@hidden address@hidden,@var{pos2}]
address@hidden -k
address@hidden --key
address@hidden sort field
+Specify a sort field that consists of the part of the line between
address@hidden and @var{pos2} (or the end of the line, if @var{pos2} is
+omitted), @emph{inclusive}.  Fields and character positions are numbered
+starting with 1.  So to sort on the second field, you'd use
address@hidden,2} (@option{-k 2,2}).  See below for more examples.
+
 @item -o @var{output-file}
 @itemx address@hidden
 @opindex -o
@@ -3313,8 +3324,10 @@ string between a non-blank character and
 That is, given the input line @address@hidden foo bar}}, @command{sort} breaks 
it
 into fields @address@hidden foo}} and @address@hidden bar}}.  The field 
separator is
 not considered to be part of either the field preceding or the field
-following.  But note that sort fields that extend to the end of the line,
-as @option{-k 2}, or sort fields consisting of a range, as @option{-k 2,3},
+following, so with @samp{sort @w{-t " "}} the same input line has
+three fields: an empty field, @samp{foo}, and @samp{bar}.
+However, fields that extend to the end of the line,
+as @option{-k 2}, or fields consisting of a range, as @option{-k 2,3},
 retain the field separators present between the endpoints of the range.
 
 To specify a zero byte (@acronym{ASCII} @sc{nul} (Null) character) as
@@ -3344,17 +3357,6 @@ Normally, output only the first of a seq
 equal.  For the @option{--check} (@option{-c}) option,
 check that no pair of consecutive lines compares equal.
 
address@hidden -k @var{pos1}[,@var{pos2}]
address@hidden address@hidden,@var{pos2}]
address@hidden -k
address@hidden --key
address@hidden sort field
-Specify a sort field that consists of the part of the line between
address@hidden and @var{pos2} (or the end of the line, if @var{pos2} is
-omitted), @emph{inclusive}.  Fields and character positions are numbered
-starting with 1.  So to sort on the second field, you'd use
address@hidden,2} (@option{-k 2,2}).  See below for more examples.
-
 @item -z
 @itemx --zero-terminated
 @opindex -z
@@ -3385,7 +3387,8 @@ of the field to use and @var{c} is the n
 from the beginning of the field.  In a start position, an omitted
 @address@hidden stands for the field's first character.  In an end
 position, an omitted or zero @address@hidden stands for the field's
-last character.  If the
+last character.  If the start field falls after the end of the line
+or after the end field, the field is empty.  If the
 @option{-b} option was specified, the @address@hidden part of a field
 specification is counted from the first nonblank character of the field.
 
@@ -3395,7 +3398,12 @@ for that particular field.  The @option{
 attached to either or both of the start and
 end positions of a field specification, and if it is inherited
 from the global options it will be attached to both.
-Keys may span multiple fields.
+If input lines can contain leading or adjacent blanks and @option{-t}
+is not used, then @option{-k} is typically combined with @option{-b},
address@hidden, @option{-M}, or @option{-n}; otherwise the varying
+numbers of leading blanks in fields can cause confusing results.
+
+Keys can span multiple fields.
 
 On older systems, @command{sort} supports an obsolete origin-zero
 syntax @address@hidden address@hidden for specifying sort keys.
@@ -3410,16 +3418,18 @@ Here are some examples to illustrate var
 Sort in descending (reverse) numeric order.
 
 @example
-sort -nr
+sort -n -r
 @end example
 
 @item
-Sort alphabetically, omitting the first and second fields.
+Sort alphabetically, omitting the first and second fields
+and the blanks at the start of the third field.
 This uses a single key composed of the characters beginning
-at the start of field three and extending to the end of each line.
+at the start of the first nonblank character in field three
+and extending to the end of each line.
 
 @example
-sort -k 3
+sort -k 3b
 @end example
 
 @item
@@ -3431,7 +3441,7 @@ Use @samp{:} as the field delimiter.
 sort -t : -k 2,2n -k 5.3,5.4
 @end example
 
-Note that if you had written @option{-k 2} instead of @option{-k 2,2}
+Note that if you had written @option{-k 2n} instead of @option{-k 2,2n}
 @command{sort} would have used all characters beginning in the second field
 and extending to the end of the line as the primary @emph{numeric}
 key.  For the large majority of applications, treating keys spanning
@@ -3447,18 +3457,58 @@ field-end part of the key specifier.
 @item
 Sort the password file on the fifth field and ignore any
 leading blanks.  Sort lines with equal values in field five
-on the numeric user ID in field three.
+on the numeric user ID in field three.  Fields are separated
+by @samp{:}.
 
 @example
 sort -t : -k 5b,5 -k 3,3n /etc/passwd
address@hidden example
-
-An alternative is to use the global numeric modifier @option{-n}.
-
address@hidden
 sort -t : -n -k 5b,5 -k 3,3 /etc/passwd
+sort -t : -b -k 5,5 -k 3,3n /etc/passwd
 @end example
 
+These three commands have equivalent effect.  The first specifies that
+the first key's start position ignores leading blanks and the second
+key is sorted numerically.  The other two commands rely on global
+options being inherited by sort keys that lack modifiers.  The inheritance
+works in this case because @option{-k 5b,5b} and @option{-k 5b,5} are
+equivalent, as the location of a field-end lacking a @address@hidden
+character position is not affected by whether initial blanks are
+skipped.
+
address@hidden
+Sort a set of log files, primarily by IPv4 address and secondarily by
+time stamp.  If two lines' primary and secondary keys are identical,
+output the lines in the same order that they were input.  The log
+files contain lines that look like this:
+
address@hidden
+4.150.156.3 - - [01/Apr/2004:06:31:51 +0000] message 1
+211.24.3.231 - - [24/Apr/2004:20:17:39 +0000] message 2
address@hidden example
+
+Fields are separated by exactly one space.  Sort IPv4 addresses
+lexicographically, e.g., 212.61.52.2 sorts before 212.129.233.201
+because 61 is less than 129.
+
address@hidden
+sort -s -t ' ' -k 4.9n -k 4.5M -k 4.2n -k 4.14,4.21 file*.log |
+sort -s -t '.' -k 1,1n -k 2,2n -k 3,3n -k 4,4n
address@hidden example
+
+This example cannot be done with a single @command{sort} invocation,
+since IPv4 address components are separated by @samp{.} while dates
+come just after a space.  So it is broken down into two invocations of
address@hidden: the first sorts by time stamp and the second by IPv4
+address.  The time stamp is sorted by year, then month, then day, and
+finally by hour-minute-second field, using @option{-k} to isolate each
+field.  Except for hour-minute-second there's no need to specify the
+end of each key field, since the @samp{n} and @samp{M} modifiers sort
+based on leading prefixes that cannot cross field boundaries.  The
+IPv4 addresses are sorted lexicographically.  The second sort uses
address@hidden so that ties in the primary key are broken by the secondary
+key; the first sort uses @samp{-s} so that the combination of the two
+sorts is stable.
+
 @item
 Generate a tags file in case-insensitive sorted order.
 
@@ -3470,21 +3520,6 @@ The use of @option{-print0}, @option{-z}
 that pathnames that contain Line Feed characters will not get broken up
 by the sort operation.
 
-Finally, to ignore both leading and trailing blanks, you
-could have applied the @samp{b} modifier to the field-end specifier
-for the first key,
-
address@hidden
-sort -t : -n -k 5b,5b -k 3,3 /etc/passwd
address@hidden example
-
-or by using the global @option{-b} modifier instead of @option{-n}
-and an explicit @samp{n} with the second key specifier.
-
address@hidden
-sort -t : -b -k 5,5 -k 3,3n /etc/passwd
address@hidden example
-
 @c This example is a bit contrived and needs more explanation.
 @c @item
 @c Sort records separated by an arbitrary string by using a pipe to convert
@@ -12972,7 +13007,7 @@ The final pipeline looks like this:
 
 @smallexample
 $ tr '[A-Z]' '[a-z]' < whats.gnu | tr -cd '[A-Za-z0-9_ \012]' |
-> tr -s '[ ]' '\012' | sort | uniq -c | sort -nr
+> tr -s '[ ]' '\012' | sort | uniq -c | sort -n -r
 @print{}  156 the
 @print{}   60 a
 @print{}   58 to
Index: src/sort.c
===================================================================
RCS file: /home/meyering/coreutils/cu/src/sort.c,v
retrieving revision 1.282
diff -p -u -r1.282 sort.c
--- src/sort.c  20 Apr 2004 15:08:57 -0000      1.282
+++ src/sort.c  25 Apr 2004 04:48:00 -0000
@@ -148,8 +148,8 @@ struct keyfield
   size_t echar;                        /* Additional characters in field. */
   bool const *ignore;          /* Boolean array of characters to ignore. */
   char const *translate;       /* Translation applied to characters. */
-  bool skipsblanks;            /* Skip leading blanks at start. */
-  bool skipeblanks;            /* Skip trailing blanks at finish. */
+  bool skipsblanks;            /* Skip leading blanks when finding start.  */
+  bool skipeblanks;            /* Skip leading blanks when finding end.  */
   bool numeric;                        /* Flag for numeric comparison.  Handle
                                   strings of digits with optional decimal
                                   point, but no exponential notation. */
@@ -912,7 +912,7 @@ limfield (const struct line *line, const
 
   /* If we're skipping leading blanks, don't start counting characters
      until after skipping past any leading blanks.  */
-  if (key->skipsblanks)
+  if (key->skipeblanks)
     while (ptr < lim && blanks[UCHAR (*ptr)])
       ++ptr;
 
@@ -926,17 +926,6 @@ limfield (const struct line *line, const
   return ptr;
 }
 
-/* Return the number of trailing blanks in FIELD, with LEN bytes.  */
-
-static size_t
-trailing_blanks (char const *field, size_t len)
-{
-  size_t i;
-  for (i = len; 0 < i && blanks[UCHAR (field[i - 1])]; i--)
-    continue;
-  return len - i;
-}
-
 /* Fill BUF reading from FP, moving buf->left bytes from the end
    of buf->buf to the beginning first.  If EOF is reached and the
    file wasn't terminated by a newline, supply one.  Set up BUF's line
@@ -1023,11 +1012,6 @@ fillbuf (struct buffer *buf, register FI
                          line_start++;
                      line->keybeg = line_start;
                    }
-                 if (key->skipeblanks)
-                   {
-                     size_t keylen = line->keylim - line->keybeg;
-                     line->keylim -= trailing_blanks (line->keybeg, keylen);
-                   }
                }
 
              line_start = ptr;
@@ -1312,7 +1296,6 @@ getmonth (const char *s, size_t len)
   month = alloca (len + 1);
   for (i = 0; i < len; ++i)
     month[i] = fold_toupper[UCHAR (s[i])];
-  len -= trailing_blanks (month, len);
   month[len] = '\0';
 
   do
@@ -1357,12 +1340,6 @@ keycompare (const struct line *a, const 
       /* Find the lengths. */
       size_t lena = lima <= texta ? 0 : lima - texta;
       size_t lenb = limb <= textb ? 0 : limb - textb;
-
-      if (key->skipeblanks)
-       {
-         lena -= trailing_blanks (texta, lena);
-         lenb -= trailing_blanks (textb, lenb);
-       }
 
       /* Actually compare the fields. */
       if (key->numeric | key->general_numeric)
Index: tests/pr/Test.pm
===================================================================
RCS file: /home/meyering/coreutils/cu/tests/pr/Test.pm,v
retrieving revision 1.16
diff -p -u -r1.16 Test.pm
--- tests/pr/Test.pm    9 Aug 2002 22:13:37 -0000       1.16
+++ tests/pr/Test.pm    25 Apr 2004 06:24:50 -0000
@@ -3,7 +3,7 @@ package Test;
 require 5.002;
 use strict;
 
-# Tell head to accept old-style options like `-1'.
+# Tell pr to accept old-style options like operand-less `-S'.
 $Test::env_default = ['_POSIX2_VERSION=199209'];
 
 my @tv = (
Index: tests/sort/Test.pm
===================================================================
RCS file: /home/meyering/coreutils/cu/tests/sort/Test.pm,v
retrieving revision 1.27
diff -p -u -r1.27 Test.pm
--- tests/sort/Test.pm  4 Sep 2003 22:11:16 -0000       1.27
+++ tests/sort/Test.pm  25 Apr 2004 05:00:58 -0000
@@ -3,7 +3,7 @@ package Test;
 require 5.002;
 use strict;
 
-# Tell head to accept old-style options like `-1'.
+# Tell sort to accept old-style options like `-1'.
 $Test::env_default = ['_POSIX2_VERSION=199209'];
 
 my @tv = (
@@ -162,15 +162,16 @@ my @tv = (
 # key specifier when a key-specific option (`n' in this case) is used.
 ["18b", '-b -k1.1,1.2n', " 901\n100\n", " 901\n100\n", 0],
 
-# No change from above because the `b' on the key-end part of the
-# key specifier makes sort ignore only trailing blanks
-["18c", '-k1.1,1.2nb', " 901\n100\n", " 901\n100\n", 0],
-
-# Here we're comparing `90' and `10', because the `b' on the key-start
-# specifier makes sort ignore *leading* blanks on that key.
-["18d", '-k1.1b,1.2n', " 901\n100\n", "100\n 901\n", 0],
+# Here we're comparing ` 90' and `10', because the `b' on the key-end specifier
+# makes sort ignore leading blanks when determining that key's *end*.
+["18c", '-k1.1,1.2nb', " 901\n100\n", "100\n 901\n", 0],
+
+# Here we're comparing `9' and `10', because the `b' on the key-start specifier
+# makes sort ignore leading blanks when determining that key's *start*.
+["18d", '-k1.1b,1.2n', " 901\n100\n", " 901\n100\n", 0],
 
-# Equivalent to above, except it ignores both leading and trailing blanks.
+# This compares `90' and `10', as it ignores leading blanks for both
+# key start and key end.
 ["18e", '-nb -k1.1,1.2', " 901\n100\n", "100\n 901\n", 0],
 
 # This looks odd, but works properly -- 2nd keyspec is never




reply via email to

[Prev in Thread] Current Thread [Next in Thread]