[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: May strcoll return 0 if strcmp returns non-0
From: |
Stephane Chazelas |
Subject: |
Re: May strcoll return 0 if strcmp returns non-0 |
Date: |
Tue, 31 Mar 2015 16:38:19 +0100 |
User-agent: |
Mutt/1.5.21 (2010-09-15) |
2015-03-31 06:36:54 -0600, Eric Blake:
> FYI: This thread on the Austin Group mailing list claims that coreutils
> has a bug in at least uniq (although Stephane has not yet filed formal
> bug reports against the standard, so we may instead be able to get the
> standard relaxed to allow our behavior of collating rather than
> comparing strings).
[...]
Well,
The problem is that as per POSIX and clarified by Geoff in that
thread, uniq should report unique lines, not just the first of a
sequence of lines that sort the same.
However, that would mean that uniq can no longer be used on the
output of sort in those locales that have collating elements
that sort the same (all UTF-8 locales with glibc).
For instance,
in a en_US.UTF-8 locale (the default in the US for most modern
GNU systems).
printf '%b\n' '\u2461' '\u2460' '\u2461' | sort | recode ..dump
outputs:
$ printf '%b\n' '\u2461' '\u2460' '\u2461' | sort | recode
..dump
UCS2 Mne Description
2461 2-o circled digit two
000A LF line feed (lf)
2460 1-o circled digit one
000A LF line feed (lf)
2461 2-o circled digit two
000A LF line feed (lf)
That's because the sorting order of those character is not
defined so they all sort the same (and in that case their order
is not modified as GNU sort implements a stable sort).
a POSIX uniq is required to leave that output untouched, while a
POSIX sort -u is required to output only one of those (either
U+2460 or U+2461)
GNU uniq behaviour is a bit more consistent in that sort|uniq
behaves like sort -u.
$ printf '%b\n' '\u2461' '\u2460' '\u2461' | sort | uniq | recode ..dump
UCS2 Mne Description
2461 2-o circled digit two
000A LF line feed (lf)
Now, those would not be problems if all locales provided with
strict total orders, where there is not two collating elements
sorting the same.
That's really what I have an issue with as that breaks most
people assumptions.
On the other hand, we have GNU awk not conformant because its
"==" operator checks for "equality" while POSIX requires it to
check for "sorting the same".
POSIX requires U+2460 == U+2461 (in awk) to return true in
locales where those two characters sort the same.
I'm rather glad awk is not conformant here even if that means
that none of U+2460 < U+2461, U+2461 > U+2460 or U+2460 ==
U+2461 is true (note that GNU expr says yes to U+2460 = U+2461
(as required by POSIX)).
Note that GNU comm and GNU join are potentially non-conformant
here as well (the discussion is not over on the Austin group
ML).
I'm not sure the GNU tools should be modified here (more POSIX
relaxed to allow GNU behaviour), but I'd be in favour of the
locales shipped with the GNU libc to be modified so all colating
elements have different order.
--
Stephane