bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: possible bug in sort


From: John Novatnack
Subject: Re: possible bug in sort
Date: Sun, 10 Dec 2006 16:37:46 -0500

Cheers Bob.  Thanks for your insightful email.  My supposed "bug" has been
resolved :-).

On 12/9/06, Bob Proulx <address@hidden> wrote:

John Novatnack wrote:
> I ran across strange behavior of the Unix sort command.

Thanks for the bug report.  But as you know GNU is Not Unix.  :-)
What version of GNU sort are you using?

  sort --version

What is your locale setting?

  locale

> But now see what happens when I add a trailing zero.
>
> $ sort -n
> 0.1 2
> 0.1 3
> 0.1 1
> 0.10 2
>
> 0.10 2
> 0.1 1
> 0.1 2
> 0.1 3

I cannot recreate your problem on my Debian system using either the
stock sort or the latest cvs sort.  However my eye spots some problems
with your use.  The documentation says this:

     Numeric sort uses what might be considered an unconventional
     method to compare strings representing floating point numbers.
     Rather than first converting each string to the C `double' type
     and then comparing those values, `sort' aligns the decimal-point
     characters in the two strings and compares the strings a character
     at a time.  One benefit of using this approach is its speed.  In
     practice this is much more efficient than performing the two
     corresponding string-to-double (or even string-to-integer)
     conversions and then comparing doubles.  In addition, there is no
     corresponding loss of precision.  Converting each string to
     `double' before comparison would limit precision to about 16
     digits on most systems.

Also:

     A pair of lines is compared as follows: if any key fields have been
  specified, `sort' compares each pair of fields, in the order specified
  on the command line, according to the associated ordering options,
  until a difference is found or no fields are left.  Unless otherwise
  specified, all comparisons use the character collating sequence
  specified by the `LC_COLLATE' locale.  (1)

And importantly:

     For the large majority of applications, treating keys spanning
     more than one field as numeric will not do what you expect.

Therefore you should specify field options.  Since you are apparently
wanting to sort first on the first field and then second on the second
field then this is really what you want.

Try this:

  printf "0.1 2\n0.1 3\n0.1 1\n0.10 2\n" | sort -k 1,1n -k 2,2n
  0.1 1
  0.1 2
  0.10 2
  0.1 3

Don't miss the note about sort respecting your current locale
settings.

     (1) If you use a non-POSIX locale (e.g., by setting `LC_ALL' to
  `en_US'), then `sort' may produce output that is sorted differently
  than you're accustomed to.  In that case, set the `LC_ALL' environment
  variable to `C'.  Note that setting only `LC_COLLATE' has two problems.
  First, it is ineffective if `LC_ALL' is also set.  Second, it has
  undefined behavior if `LC_CTYPE' (or `LANG', if `LC_CTYPE' is unset) is
  set to an incompatible value.  For example, you get undefined behavior
  if `LC_CTYPE' is `ja_JP.PCK' but `LC_COLLATE' is `en_US.UTF-8'.

Bob



reply via email to

[Prev in Thread] Current Thread [Next in Thread]