[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Request for feature, sort
From: |
David Mathog |
Subject: |
Re: Request for feature, sort |
Date: |
Mon, 05 Nov 2018 09:34:48 -0800 |
User-agent: |
Roundcube Webmail/0.9.5 |
On 04-Nov-2018 00:45, Pádraig Brady wrote:
I'm not sure how frequent this partially sorted data pattern is,
though given the complexity improvements it seems worthwhile.
It is extremely common in Biology applications. For instance, search a
set of DNA sequences (with pre-ordered names like contig_1 -> contig_N,
or frequently just 1 -> N) against a target database and the output will
(for most applications) have clusters of results for 1, then 2, ... then
N.
There are some programs like that which when run multithreaded would
emit those same blocks of results but with the block order shuffled over
a window corresponding to the number of threads. That is, they just
emit each block when it finishes without doing anything to keep the
blocks in order. Taking advantage of "block order", but not "absolute
order", in sort would require yet another key modifier. While this is a
very closely related problem I'm not sure it is common enough to warrant
the extra modifier. That said, I think the only modification needed
from an "absolute order" implementation to a "block order" one is that
the program not consider an out of sorted order key change to be an
error.
So 'p' would be a modifier applicable to each key.
When the data in the key changes we could do
the full key comparison to ensure the order is correct,
and if so output all current sorted data for the just presorted key.
'p' modified keys would need to come before any non 'p' keys.
Exactly.
Conceivably there might be situations where the input file was already
sorted on keys 1 and 3, but not on key 2. However, retaining the order
information for the 3rd key while sorting the 2nd is likely more effort
than it is worth. Certainly diminishing returns set in very quickly in
that arena - consider how difficult it would be to retain the order
information for "-k 1,1 -k 2,2 -k 3,3p" for an input file of a billion
lines. Anyway, requiring all "p" keys to precede all non "p" keys is
not an onerous restriction.
Regards,
David Mathog
address@hidden
Manager, Sequence Analysis Facility, Biology Division, Caltech