Re: File divide to feed parallel

Ole,

Well, I suspect that again, 'dd skip=XXX file' would read sequentially to find record delimiters. The speed and parallelism support comes from seeking and then looking for the delimiter and starting after it. I am not sure there is currently a general UNIX tool that does that. If read()/fgets()'s are supported under the covers by mmap() (often true, now), then just N pages are initially involved for N parallel reads -- no extra reading, page faults, etc.

It would be nice to have an alternative to pipe input, more powerful and parallel friendly than 'cat * |'. Parallel could include and integrate such an input reader. There might be pipes involved, under the covers, unless you want to do a syscall() replacement of read() for child programs, which does not sound simple, portable or reliable, to hide the initial seek, search and the early EOF! Including concatenation and fixed length record support also facilitates the user's correct choice. One just points parallel at a pile of files to be read N parallel, and parallel takes care of the concatenation and division for even loading.

Parallel has become a sort of shell extension, and this could add a valuable facet to the capabilities. Also, parallel processing often produces multiple files in one stage, that need to be fed into a subsequent stage. Having this ensures any variation in file sizes in stage n does not imbalance stage n+1.

To support such multiple stage processing without files, I suppose parallel might be made capable of reading records from N named pipes in a non-blocking and record integrity preserving fashion to feed next stage processing with minimum latency and no intermediate files, file names, file space? It'd be somewhat like 'sort -m <(sort a) <(sort b) <(sort c) ...', which by virtue of the nature of sort has low pipe blocking overhead. It's the opposite of round robin, reading pipes for full records in rotation non-blocking (perhaps using poll()/select() -- I have found concurrent (lwp) threads and blocking to be higher overhead). The pipe names might be managed under the covers by parallel, like the ksh/bash <(...) (Well, hopefully, better, as bash leaves a trail of named pipes in /var/tmp on systems without /dev/fd/# pipes. How do you know, generally, that a pipe is never going to be put into use again?)

I will look into mbuffer -- rewriting things must be rife across UNIX developers! I rewrote comm at one point. I have a rewrite of join, m1join, that can do many to one joins without seeks (pipe friendly) (no cartesian capability, which is not pipe friendly, but many problems are 1 for 1 or many to one not many to many).

Best,

David

-----Original Message-----
From: Ole Tange <ole@tange.dk>
To: David <dgpickett@aol.com>
Cc: parallel <parallel@gnu.org>
Sent: Thu, Mar 27, 2014 11:05 am
Subject: Re: File divide to feed parallel

On Thu, Mar 27, 2014 at 2:32 PM, David <dgpickett@aol.com> wrote:
> Ole,
>
> Yes, the idea is to level the parallel loads without a single point
> bottleneck of a serial reader.  In the world of big data, you want the
> parallel processes to use their logical id to seek to the desired position
> in the desired starting file, find the first new record, and read through
> the record containing the desired end offset.  Once the file division master
> thread builds a listing of file names, sizes, and sets the chunk size, then
> the parallel threads/processes can be created and begin reading in parallel.

I understand the idea. It does require the input to be a file, and not
a pipe, and currently GNU Parallel does not support that.

If GNU Parallel was to support it, I imagine I would have a process
that would figure out where to chop the blocks, and then pass that to
the main program which would then start a 'dd skip=XXX if=file |
your_program'

The file could possibly be given as -a argument to --pipe:

    parallel -a bigfile --pipe --block 1g --recend '\n\n' yourprogram
# Not implemented

If that was implemented, what should this do (multiple -a):

    parallel -a file1 -a file2 --pipe --block 1g --recend '\n\n'
yourprogram # Not implemented

> Reading sequentially and sending the records down pipes in an array in
> rotation is an alternative,

That is what --round-robin does now.

> but prone to several problems:  1) One slow pipe
> can block the reading of input.  It might be possible to skip slow pipes
> with some sort of per pipe buffering and non-blocking i/o.  I wrote a
> buffering pipe fitting that can help soften this, but that adds overhead
> with an extra pipe and process.

I can highly recommand mbuffer: extremely small overhead.

>  2) Sometimes each parallel processing is
> not N times slower than reading a file and writing a pipe.  3) The read is
> not subject to any parallelism to speed it.

Yep. All true.

> Reading file names and assigning them to parallel threads in size descending
> order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has
> parallel reading, but despite size leveling, often the largest files
> dominate the run time.  If there are not N files, there will not be any N
> way parallelism.

I am wondering if that really is a job for GNU Parallel? I often use
GNU Parallel for tasks, where file size does not matter at all (e.g.
rename a file).

Would it not make more sense if you sorted the input by file size?

    ls files | sort --by-size | parallel 'your program < {}'

    find . -type f | perl -e 'print map {$_,"\n"} sort { chomp($a,$b);
-s $a <=> -s $b } <>' | parallel -k ls -l

> It might be nice to have an option to have chunk sizes increased to modulo
> 8192

--block 1M = --block 1048576, so try this:

    cat bigfile | parallel --pipe --block 1M --recend '' wc

> or the like so pages are less split, but really, if there is a
> delimiter, chunk edge pages are always split.

Yep.

> An option for undelimited, fixed length records could provide the record
> size, so chunks could always be in modulo-record-size bytes.

Elaborate why '--recend "" --block 8k --pipe' does not solve that.

> Does parallel ever worry about unicode, euc and such that might need to work
> in n-byte or variable wide characters?  I guess if you knew it was a utf-8
> file, you could find the character boundaries, but not all systems have such
> nice middle of file sync indicators.

GNU Parallel passes that worry on to Perl. So nothing in GNU Parallel
specifically deals with multibyte charsets.


/Ole

From:	David
Subject:	Re: File divide to feed parallel
Date:	Thu, 27 Mar 2014 11:49:58 -0400 (EDT)