bug-coreutils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#13089: Wish: split every n'th into n pipes


From: Pádraig Brady
Subject: bug#13089: Wish: split every n'th into n pipes
Date: Thu, 06 Dec 2012 13:02:34 +0000
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:13.0) Gecko/20120615 Thunderbird/13.0.1

On 12/06/2012 12:20 PM, Ole Tange wrote:
> On Thu, Dec 6, 2012 at 12:41 PM, Pádraig Brady <address@hidden> wrote:
>> On 12/06/2012 11:25 AM, Pádraig Brady wrote:
>>> On 12/06/2012 12:06 AM, Ole Tange wrote:
>>>>
>>>> Do you have a similar reference:
>>>>
>>>> * if each record is k lines (e.g. 4 lines as is the case in FASTQ files)
>>>> * If each record has a record separator (e.g. > in FASTA files)
>>>
>>> I'd probably preprocess first to a single line:
>>>
>>> The following may not be robust or efficient.
>>> I suspect there may be tools already to efficiently
>>> parse fast[aq] to a single line:
>>>
>>>     fastalines(){ sed -n '/^>/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; }
>>>     fastqlines(){ sed -n '/^@/!{H;$!b};s/$/\x00/;x;1b;s/\n//g;p'; }
>>>
>>> Then use like:
>>>
>>>   fasta_source | fastalines |
>>>   split -n r/8 --filter='tr '\0' '\n'; process_fasta'
>
> Here you assume that the quality score never reaches '@'. You cannot
> do that, because it sometimes reaches @. The only thing you can be
> sure of is every record is 4 lines.

Sure. I mentioned they might not be robust. These may be better:

fastalines(){ sed '1!s/^>/\x00&/' | tr '\n\0' '\0\n'; }
fastqlines(){ paste -d $'\1' - - - - | tr '\1' '\0' }

> I was hoping for a general solution that would work no matter the
> content. Your solution breaks if the content contain \0 (NULs are not
> in FAST[AQ] files, but may be in other formats).

Fair point, but you can use the general technique
of transforming (encoding) NULs to something else
before processing, in the unlikely case they're
present in the input.

> Do you see support coming for n-line records in split?

Given the above options, probably not.
Maybe we could add support for --zero-terminated
to treat \0 as the delimiter rather than \n,
which might simplify postprocessing required?

> Do you see support coming for records split on regexp in split?

Given the complexity, probably not.
regexps would be better maintained within sed etc.
which could do the annotation for later splitting.

Note also the `cpslit` util, but I don't see us updating
that to supporting a fixed number of outputs like `split` either.

cheers,
Pádraig.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]