[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#8067: sort fails to sort completely, due to "similar" keys.

From: Eric Blake
Subject: bug#8067: sort fails to sort completely, due to "similar" keys.
Date: Thu, 17 Feb 2011 14:30:55 -0700
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv: Gecko/20101209 Fedora/3.1.7-0.35.b3pre.fc14 Lightning/1.0b3pre Mnenhy/0.8.3 Thunderbird/3.1.7

On 02/17/2011 01:46 PM, Bob Harris wrote:
> Howdy,
> (note: I know I should give you version information with this, but (1) I
> am not sure that this message will be read by anyone, and (2) I think
> the problem probably transcends versions.  If I get a response and the
> actual version is important, I will take the time to find it.)

Thanks for the report, and you are correct that your issue transcends
versions.  However, if you use coreutils 8.6 or newer (the latest is
8.10), then the new --debug option would have helped you.

> I have a file of genomic short sequence info in which it so happens that
> two of my sort key values are similar.  The two keys are
>     HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1
>     HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1
> As you can see, these are identical if one removes the colons.

Which sounds like exactly what sort does when you are sorting in the
en_US.UTF-8 locale.

> I have tried several different options but none seem to work.  -d seems
> to be the default, and it has the behavior indicated above.  -n fails
> completely.  -g also fails.  Reading the man page, I don't see any other
> options to control the comparison function.

Then you missed this part (in the sort man page, which is in turn
generated from 'sort --help'):

*** WARNING ***
The locale specified by the environment affects sort order.
Set LC_ALL=C to get the traditional sort order that uses
native byte values.

> I understand *why* -d considers these two keys equal.  What I don't
> understand is why there is no option that says "order them
> lexicographically".

That option is your set of locale-specific environment variables.  Why
it's not an explicit option is due to historical accident (that's the
way POSIX specified it).  Maybe GNU sort should add a
--collate-locale=... option as an extension that overrides LC_ALL, but
that seems a bit like bloat, and doesn't buy much over using the
standardized means of choosing collation sequencing.

> Is there a hidden sort option that will do what I need?

Yep - try 'LC_ALL=C sort ...' to see the difference.

> I'm pretty sure I'm not the first person to run into this problem.

You're not.  It's a FAQ:


Eric Blake   address@hidden    +1-801-349-2682
Libvirt virtualization library http://libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature

reply via email to

[Prev in Thread] Current Thread [Next in Thread]