emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38627: closed (uniq -c gets wrong count with non-ascii strings)


From: GNU bug Tracking System
Subject: bug#38627: closed (uniq -c gets wrong count with non-ascii strings)
Date: Sun, 23 Feb 2020 19:44:01 +0000

Your message dated Sun, 23 Feb 2020 19:43:27 +0000
with message-id <address@hidden>
and subject line Re: bug#38627: uniq -c gets wrong count with non-ascii strings
has caused the debbugs.gnu.org bug report #38627,
regarding uniq -c gets wrong count with non-ascii strings
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden.)


-- 
38627: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=38627
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: uniq -c gets wrong count with non-ascii strings Date: Sun, 15 Dec 2019 14:40:14 -0500
With the following input:

$ cat x
"ⁿᵘˡˡ"
"ܥܝܪܐܩ"

Running "uniq -c" says there's two copies of the same line!

$ uniq -c x
      2 "ⁿᵘˡˡ"

I've attached a copy of the test file, and here's the octal dump:

$ od -b x
0000000 042 342 201 277 341 265 230 313 241 313 241 042 012 042 334 245
0000020 334 235 334 252 334 220 334 251 042 012
0000032


I'm getting this on:

Linux tools-sgebastion-08 4.9.0-8-amd64 #1 SMP Debian 4.9.130-2 (2018-10-27) x86_64 GNU/Linux
uniq (GNU coreutils) 8.26

My MacOS 10.13.6 box gets it right:

$ uniq -c x
   1 "ⁿᵘˡˡ"
   1 "ܥܝܪܐܩ"


Attachment: x
Description: Binary data


--- End Message ---
--- Begin Message --- Subject: Re: bug#38627: uniq -c gets wrong count with non-ascii strings Date: Sun, 23 Feb 2020 19:43:27 +0000 User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:73.0) Gecko/20100101 Thunderbird/73.0
On 17/12/2019 17:25, Roy Smith wrote:
I stopped short of actually building uniq.c from source (bootstrap, 
prerequisites, ...), but looking at the code, it looks like the call chain is:

different()
xmemcoll()
memcoll()
strcoll()

so I tried a little test at the strcoll() level:

#include <stdio.h>
#include <unistd.h>
#include <string.h>

int
main (int argc, char **argv)
{
   unsigned char null[] = {

     0342, 0201, 0277, 0341, 0265, 0230, 0313, 0241, 0313, 0241, 0
   };
   unsigned char iraq[] = {
     0334, 0245, 0334, 0235, 0334, 0252, 0334, 0220, 0334, 0251, 0};

   printf("%s\n", null);
   printf("%s\n", iraq);

   int m = strcoll(null, iraq);
   printf("m = %d\n", m);
}

That correctly says the strings are different:

$ LANG=en_US.UTF-8 ./a.out
ⁿᵘˡˡ
ܥܝܪܐܩ
m = 6






On Dec 16, 2019, at 7:46 PM, Roy Smith <address@hidden> wrote:

Yup, this does depend on the locale.  In my original example, I had 
LANG=en_US.UTF-8.  Setting it to C.UTF-8 gets me the right result:

$ LANG=C.UTF-8 uniq -c x
       1 "ⁿᵘˡˡ"
       1 "ܥܝܪܐܩ"


But, that doesn't fully explain what's going on.  I find it difficult to believe that there's 
any collation sequence in the world where those two strings should compare the same.  I've 
been playing around with the ICU string compare demo 
<http://demo.icu-project.org/icu-bin/locexp?_=en_US&d_=en&x=col> and can't 
reproduce this there.  Possibly I just haven't hit upon the right combination of options to 
set, but I think it's far-fetched that there's any such combination for which those two 
strings comparing equal is legitimate.

I think you ran your test on a newer glibc.
Testing on older glibc-2.22 I see the issue with strcoll() returning 0 for the 
above strings,
while it returns an expected difference on glibc-2.30 at least.

There are a few things to reason about with removing strcoll(), namely:
  buggy strcoll implementations
  inconsistent unicode normalization
  mismatched locale settings and data
  handling of characters ignored in collation order

tl;dr is that strcoll() should be removed for all these reasons,
and I've added a test for each of the 4 cases above in the attached patch,
which I'll push later.

Marking this as done.

thanks,
Pádraig

Attachment: uniq-no-strcoll.patch
Description: Text Data


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]