[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#8067: sort fails to sort completely, due to "similar" keys.
From: |
Bob Harris |
Subject: |
bug#8067: sort fails to sort completely, due to "similar" keys. |
Date: |
Thu, 17 Feb 2011 15:46:16 -0500 |
Howdy,
(note: I know I should give you version information with this, but (1)
I am not sure that this message will be read by anyone, and (2) I
think the problem probably transcends versions. If I get a response
and the actual version is important, I will take the time to find it.)
I have a file of genomic short sequence info in which it so happens
that two of my sort key values are similar. The two keys are
HWI-ST407_110127_0082_A80L25ABXX:5:2:11746:46371#0/1
HWI-ST407_110127_0082_A80L25ABXX:5:21:17464:6371#0/1
As you can see, these are identical if one removes the colons.
Unfortunately, I have a file with something on the order of 4 million
lines, and there are roughly a dozen lines with each of these keys. I
am using sort with the intent of collecting the lines for each key
together. (I don't really care about ordering, I just need to group
lines with the same key together to facilitate downstream
processing). The unfortunate part is that sort considers the two keys
as equal. And so it fails to create the grouping I need.
I have tried several different options but none seem to work. -d
seems to be the default, and it has the behavior indicated above. -n
fails completely. -g also fails. Reading the man page, I don't see
any other options to control the comparison function. I have also
tried massaging my file prior to piping into sort, replacing colons
with other characters (e.g. underscore or tilde) but with no success.
I understand *why* -d considers these two keys equal. What I don't
understand is why there is no option that says "order them
lexicographically".
Is there a hidden sort option that will do what I need?
About the only way I can think to force sort to actually sort on such
a key is to pre-process the file and replace the keys with a hash code
(rendered with nothing but A-Z). But this introduces additional
issues, such as maintaining a table so I can convert the keys back
after sorting, and making sure my hash is unique, etc. etc.
I'm pretty sure I'm not the first person to run into this problem.
Thanks for any help or advice.
Bob H
- bug#8067: sort fails to sort completely, due to "similar" keys.,
Bob Harris <=