bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: sort on multicolumn files


From: Paul Eggert
Subject: Re: sort on multicolumn files
Date: Mon, 20 Feb 2006 18:58:20 -0800
User-agent: Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux)

P Kensche <address@hidden> writes:

> is sorted by "sort -k 1 a"

In general, that's not correct since it sorts by fields 1 through N,
whereas 'join' sorts only by field 1.  You need to use "sort -k 1b,1"
instead.  So, as far as I can tell, you haven't found a bug.

However, `-k 1b,1' isn't immediately obvious, and the documentation
should be improved here.  I installed the following patch to try to
improve things.  Thanks for reporting the problem.

2006-02-20  Paul Eggert  <address@hidden>

        * doc/coreutils.texi (join invocation): Mention `sort -k 1b,1'.
        * src/join.c (usage): Likewise.
        Documentation problem reported by Philip Kensche.

--- doc/coreutils.texi  20 Feb 2006 16:50:11 -0000      1.313
+++ doc/coreutils.texi  21 Feb 2006 02:50:39 -0000
@@ -4738,11 +4738,11 @@ lines that have identical join fields.  
 join address@hidden@dots{} @var{file1} @var{file2}
 @end example
 
address@hidden LC_COLLATE
 Either @var{file1} or @var{file2} (but not both) can be @samp{-},
 meaning standard input.  @var{file1} and @var{file2} should be
 sorted on the join fields.
 
address@hidden LC_COLLATE
 Normally, the sort order is that of the
 collating sequence specified by the @env{LC_COLLATE} locale.  Unless
 the @option{-t} option is given, the sort comparison ignores blanks at
@@ -4750,7 +4750,14 @@ the start of the join field, as in @code
 @option{--ignore-case} option is given, the sort comparison ignores
 the case of characters in the join field, as in @code{sort -f}.
 
-However, as a GNU extension, if the input has no unpairable lines the
+The @command{sort} and @command{join} commands should use consistent
+locales and options if the output of @command{sort} is fed to
address@hidden  You can use a command like @samp{sort -k 1b,1} to
+sort a file on its default join field, but if you select a non-default
+locale, join field, separator, or comparison options, then you should
+do so consistently between @command{join} and @command{sort}.
+
+As a GNU extension, if the input has no unpairable lines the
 sort order can be any order that considers two fields to be equal if and
 only if the sort comparison described above considers them to be equal.
 For example:
@@ -4841,6 +4848,8 @@ option---are subject to the specified @v
 @item -t @var{char}
 Use character @var{char} as the input and output field separator.
 Treat as significant each occurrence of @var{char} in the input file.
+Use @samp{sort -t @var{char}}, without the @option{-b} option of
address@hidden, to produce this ordering.
 
 @item -v @var{file-number}
 Print a line for each unpairable line in file @var{file-number}
--- src/join.c  18 Feb 2006 07:22:01 -0000      1.144
+++ src/join.c  21 Feb 2006 02:50:40 -0000
@@ -167,6 +167,7 @@ the remaining fields from FILE1, the rem
 separated by CHAR.\n\
 \n\
 Important: FILE1 and FILE2 must be sorted on the join fields.\n\
+E.g., use `sort -k 1b,1' if `join' has no options.\n\
 "), stdout);
       printf (_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
     }




reply via email to

[Prev in Thread] Current Thread [Next in Thread]