Re: suggestion for new option: --block-break

From: Achim Gratz
Subject: Re: suggestion for new option: --block-break
Date: Sat, 04 May 2019 08:24:40 +0200
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/26.2 (gnu/linux)

Ole Tange writes:
>> > parallel --colsep ';' -j 40 –cat –block 10K --block-breaks '3 
>> > $_=substr($_,-2,2)'
>> I'm not sure what that "3" is doing there - some character transliteration 
>> problem in our email?.
> 3 is column 3. So $_ will contain the value in column 3. If no number
> given, then $_ is the full line.
> This will make it slightly harder distinguishing between a named
> column or some perl code. But I think it is OK to assume:
> * --block-breaks value contains only [a-z0-9_] and --header : is set
> => Named column
> * perl code otherwise

I think it should be an interesting extension of parallel indeed.  If I
gather the OP's requirements right, the column he wants to do the block
break on is producing a continous row section.  I'm not familiar with
the data formats of genomics, but I believe that some of them might even
have fixed line lengths.  That would allow for a binary search to figure
out the break point before going into the blocking algo, which would be
a net win if the number of blocks to read for the preprocessing is a
small fraction of the total blocks only.

If so, it really would be a preprocessing step to run before entering
parallel and the extension to parallel would be to enable handing off a
list of blocks (that parallel may further split) to it.

> Yeah, I really do not like the name --block-breaks. I like --group-by
> a little better, but not 100% happy with that either.

Or --scatter / --split(-*)?

