bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#7878: "sort" bug--inconsistent single-column sorting influenced by o


From: Randall Lewis
Subject: bug#7878: "sort" bug--inconsistent single-column sorting influenced by other columns?
Date: Thu, 20 Jan 2011 23:29:42 -0800

Hi Bob--

Wow! So, a couple comments about how I seem to have figured out every wrong way 
to use "sort" when also using "join."

Who would've thought that 

sort -k1 test1.txt

would default to sort on the entire line? (I normally would've thought that 
[,POS2] means "optional if you want to have it keep going beyond the first 
field.")

Also, who would've thought that the default "sort" would be incompatible with 
"join" and that you would need to write the command like this every time you 
wanted to use "join"?

LC_ALL=C sort test1.txt

Or that you would need a special type of "pre-sort" on the column (which I was 
executing wrong)?

sort -k1,1 -t "|" test1.txt

Regardless, here is "locale" (for the record, I'm pretty new to the 
utilities--and love them. I'm not a computer scientist, but rather an economist 
trying to fit in at Yahoo! with the engineers and computer scientists). I'm 
sure there's a good reason why there are two, and it's pretty clear that I 
novice enough that I'll have to learn that later.

bash-3.2$ locale
LANG=en_US.UTF-8
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

Thanks, Bob, for sharing two separate ways that I could get the answer the way 
I need it--two ways I could not have come up with on my own.

Thanks!

--Randall

P.S. So, the reason why sorting on the column didn't work for me was because it 
was plucking out the delimiter and then doing a string sort? Then it was string 
sorting, putting numbers before letters (as you might expect it to)? 

bash-3.2$ sort test1.txt
323|1
36|2
406|3
40|7 <-- Changed from 4 to 7 changed the sort order.
587|5

bash-3.2$ sort test1.txt
323|1
36|2
40|4
406|3
587|5





-----Original Message-----
From: Bob Proulx [mailto:address@hidden 
Sent: Thursday, January 20, 2011 10:02 PM
To: Randall Lewis
Cc: address@hidden
Subject: Re: bug#7878: "sort" bug--inconsistent single-column sorting 
influenced by other columns?

Randall Lewis wrote:
> "sort" does inconsistent sorting.

You are sure about that?  :-)

> I'm pretty sure it has NOTHING to do with the following warning,
> although I could be totally wrong.
> 
> " *** WARNING ***
> The locale specified by the environment affects sort order.
> Set LC_ALL=C to get the traditional sort order that uses
> native byte values. "

You read this, know that sort will base the sorting upon the locale
setting, but didn't tell us what locale you were using to sort?  Shame
on you.  Because you *know* I am going to ask you about it! :-)

What locale are you using?  C?  en_US.UTF-8?  Some other?  The locale
command will print this information.  Here is an example from my system.

  $ locale
  LANG=en_US.UTF-8
  LC_CTYPE="en_US.UTF-8"
  LC_NUMERIC="en_US.UTF-8"
  LC_TIME="en_US.UTF-8"
  LC_COLLATE=C
  LC_MONETARY="en_US.UTF-8"
  LC_MESSAGES="en_US.UTF-8"
  LC_PAPER="en_US.UTF-8"
  LC_NAME="en_US.UTF-8"
  LC_ADDRESS="en_US.UTF-8"
  LC_TELEPHONE="en_US.UTF-8"
  LC_MEASUREMENT="en_US.UTF-8"
  LC_IDENTIFICATION="en_US.UTF-8"
  LC_ALL=

> sort test1.txt
> 323|1
> 36|2
> 40|4
> 406|3
> 587|5

> sort test7.txt
> 323|B1
> 36|C2
> 406|B3
> 40|B4
> 587|C5

Looks okay to me for the en_US.UTF-8 locale.  But it will of course be
different in the C locale.

  $ LC_ALL=en_US.UTF-8 sort test1.txt 
  323|1
  36|2
  40|4
  406|3
  587|5

  $ LC_ALL=C sort test1.txt 
  323|1
  36|2
  406|3
  40|4
  587|5

What ordering did you expect there?  I assume you are expecting to see
these sorted as in the C locale?

> The rows are in a different order depending on the dataset--and it
> is NOT a numeric sort. I'm not even sure it is is ANY type of sort.

It is a character sort.  A string sort.  It is comparing the line of
characters from start to finish.  But it uses the system's collation
tables based upon the locale.  In the en_US.UTF-8 locale punctuation
is ignored and case is folded.  I don't like it but the powers that be
have decreed it.

Please see the FAQ:

  
http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

The standards documentation:

  http://www.opengroup.org/onlinepubs/009695399/utilities/sort.html

Variables that control localization:

  
http://www.opengroup.org/onlinepubs/009695399/basedefs/xbd_chap08.html#tag_08_02

> sort -k1 -t "|" test1.txt

Hint: If you ever think you need to use -k POS1 then you almost always
should be using -k POS1,POS2 to specify where you want the sort to
stop comparing.  Otherwise it compares all of the way to the end of
the line.

> But why did it sort inconsistently in the first place based on the
> other contents of the file rather than just focusing on the first
> column--even when I told it to?

You never told it not to continue comparing all of the way to the end
of the line.  For example this way:

  $ sort -t'|' -k1,1n -k2,2n test1.txt 
  36|2
  40|4
  323|1
  406|3
  587|5

That won't help you with join since that expects a non-numeric sort
ordering.

> Inconsistent sorting when combined with 'join' provides incorrect
> matches and duplication of records. This is a mess.

Yes.  Recent versions of join detect and warn about this.  Recent
versions of sort have a --debug option that can help to identify
problem cases.

Bob





reply via email to

[Prev in Thread] Current Thread [Next in Thread]