emacs-bug-tracker
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Emacs-bug-tracker] bug#8067: closed (sort fails to sort completely, du


From: GNU bug Tracking System
Subject: [Emacs-bug-tracker] bug#8067: closed (sort fails to sort completely, due to "similar" keys.)
Date: Thu, 17 Feb 2011 21:43:02 +0000

Your message dated Thu, 17 Feb 2011 22:51:47 +0100
with message-id <address@hidden>
and subject line Re: bug#8067: sort fails to sort completely, due to "similar" 
keys.
has caused the GNU bug report #8067,
regarding sort fails to sort completely, due to "similar" keys.
to be marked as done.

(If you believe you have received this mail in error, please contact
address@hidden)


-- 
8067: http://debbugs.gnu.org/cgi/bugreport.cgi?bug=8067
GNU Bug Tracking System
Contact address@hidden with problems
--- Begin Message --- Subject: sort fails to sort completely, due to "similar" keys. Date: Thu, 17 Feb 2011 15:46:16 -0500
Howdy,

(note: I know I should give you version information with this, but (1) I am not sure that this message will be read by anyone, and (2) I think the problem probably transcends versions. If I get a response and the actual version is important, I will take the time to find it.)

I have a file of genomic short sequence info in which it so happens that two of my sort key values are similar. The two keys are
        HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1
        HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1
As you can see, these are identical if one removes the colons.

Unfortunately, I have a file with something on the order of 4 million lines, and there are roughly a dozen lines with each of these keys. I am using sort with the intent of collecting the lines for each key together. (I don't really care about ordering, I just need to group lines with the same key together to facilitate downstream processing). The unfortunate part is that sort considers the two keys as equal. And so it fails to create the grouping I need.

I have tried several different options but none seem to work. -d seems to be the default, and it has the behavior indicated above. -n fails completely. -g also fails. Reading the man page, I don't see any other options to control the comparison function. I have also tried massaging my file prior to piping into sort, replacing colons with other characters (e.g. underscore or tilde) but with no success.

I understand *why* -d considers these two keys equal. What I don't understand is why there is no option that says "order them lexicographically".

Is there a hidden sort option that will do what I need?

About the only way I can think to force sort to actually sort on such a key is to pre-process the file and replace the keys with a hash code (rendered with nothing but A-Z). But this introduces additional issues, such as maintaining a table so I can convert the keys back after sorting, and making sure my hash is unique, etc. etc.

I'm pretty sure I'm not the first person to run into this problem.

Thanks for any help or advice.
Bob H



--- End Message ---
--- Begin Message --- Subject: Re: bug#8067: sort fails to sort completely, due to "similar" keys. Date: Thu, 17 Feb 2011 22:51:47 +0100
Eric Blake wrote:
...
>> I'm pretty sure I'm not the first person to run into this problem.
>
> You're not.  It's a FAQ:
>
> http://www.gnu.org/software/coreutils/faq/#Sort-does-not-sort-in-normal-order_0021

Thanks for replying, Eric.
I'm marking this ticket as closed.


--- End Message ---

reply via email to

[Prev in Thread] Current Thread [Next in Thread]