[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: sort on multicolumn files
From: |
Paul Eggert |
Subject: |
Re: sort on multicolumn files |
Date: |
Mon, 20 Feb 2006 18:58:20 -0800 |
User-agent: |
Gnus/5.1007 (Gnus v5.10.7) Emacs/21.4 (gnu/linux) |
P Kensche <address@hidden> writes:
> is sorted by "sort -k 1 a"
In general, that's not correct since it sorts by fields 1 through N,
whereas 'join' sorts only by field 1. You need to use "sort -k 1b,1"
instead. So, as far as I can tell, you haven't found a bug.
However, `-k 1b,1' isn't immediately obvious, and the documentation
should be improved here. I installed the following patch to try to
improve things. Thanks for reporting the problem.
2006-02-20 Paul Eggert <address@hidden>
* doc/coreutils.texi (join invocation): Mention `sort -k 1b,1'.
* src/join.c (usage): Likewise.
Documentation problem reported by Philip Kensche.
--- doc/coreutils.texi 20 Feb 2006 16:50:11 -0000 1.313
+++ doc/coreutils.texi 21 Feb 2006 02:50:39 -0000
@@ -4738,11 +4738,11 @@ lines that have identical join fields.
join address@hidden@dots{} @var{file1} @var{file2}
@end example
address@hidden LC_COLLATE
Either @var{file1} or @var{file2} (but not both) can be @samp{-},
meaning standard input. @var{file1} and @var{file2} should be
sorted on the join fields.
address@hidden LC_COLLATE
Normally, the sort order is that of the
collating sequence specified by the @env{LC_COLLATE} locale. Unless
the @option{-t} option is given, the sort comparison ignores blanks at
@@ -4750,7 +4750,14 @@ the start of the join field, as in @code
@option{--ignore-case} option is given, the sort comparison ignores
the case of characters in the join field, as in @code{sort -f}.
-However, as a GNU extension, if the input has no unpairable lines the
+The @command{sort} and @command{join} commands should use consistent
+locales and options if the output of @command{sort} is fed to
address@hidden You can use a command like @samp{sort -k 1b,1} to
+sort a file on its default join field, but if you select a non-default
+locale, join field, separator, or comparison options, then you should
+do so consistently between @command{join} and @command{sort}.
+
+As a GNU extension, if the input has no unpairable lines the
sort order can be any order that considers two fields to be equal if and
only if the sort comparison described above considers them to be equal.
For example:
@@ -4841,6 +4848,8 @@ option---are subject to the specified @v
@item -t @var{char}
Use character @var{char} as the input and output field separator.
Treat as significant each occurrence of @var{char} in the input file.
+Use @samp{sort -t @var{char}}, without the @option{-b} option of
address@hidden, to produce this ordering.
@item -v @var{file-number}
Print a line for each unpairable line in file @var{file-number}
--- src/join.c 18 Feb 2006 07:22:01 -0000 1.144
+++ src/join.c 21 Feb 2006 02:50:40 -0000
@@ -167,6 +167,7 @@ the remaining fields from FILE1, the rem
separated by CHAR.\n\
\n\
Important: FILE1 and FILE2 must be sorted on the join fields.\n\
+E.g., use `sort -k 1b,1' if `join' has no options.\n\
"), stdout);
printf (_("\nReport bugs to <%s>.\n"), PACKAGE_BUGREPORT);
}