Re: File divide to feed parallel

Ole,

Yes, the idea is to level the parallel loads without a single point bottleneck of a serial reader. In the world of big data, you want the parallel processes to use their logical id to seek to the desired position in the desired starting file, find the first new record, and read through the record containing the desired end offset. Once the file division master thread builds a listing of file names, sizes, and sets the chunk size, then the parallel threads/processes can be created and begin reading in parallel.

Reading sequentially and sending the records down pipes in an array in rotation is an alternative, but prone to several problems: 1) One slow pipe can block the reading of input. It might be possible to skip slow pipes with some sort of per pipe buffering and non-blocking i/o. I wrote a buffering pipe fitting that can help soften this, but that adds overhead with an extra pipe and process. 2) Sometimes each parallel processing is not N times slower than reading a file and writing a pipe. 3) The read is not subject to any parallelism to speed it.

Reading file names and assigning them to parallel threads in size descending order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has parallel reading, but despite size leveling, often the largest files dominate the run time. If there are not N files, there will not be any N way parallelism.

It might be nice to have an option to have chunk sizes increased to modulo 8192 or the like so pages are less split, but really, if there is a delimiter, chunk edge pages are always split.

An option for undelimited, fixed length records could provide the record size, so chunks could always be in modulo-record-size bytes.

Does parallel ever worry about unicode, euc and such that might need to work in n-byte or variable wide characters? I guess if you knew it was a utf-8 file, you could find the character boundaries, but not all systems have such nice middle of file sync indicators.

Best,

David

-----Original Message-----
From: Ole Tange <ole@tange.dk>
To: David <dgpickett@aol.com>
Cc: parallel <parallel@gnu.org>
Sent: Thu, Mar 27, 2014 4:51 am
Subject: Re: File divide to feed parallel

On Wed, Mar 26, 2014 at 9:32 PM, David <dgpickett@aol.com> wrote:
> ETL programs like Ab Initio know how to tell parallel processes to split up
> big files and process each part separately, even when the files are linefeed
> delimited (they all agree to search up (or down) for the dividing linefeed
> closest to N bytes down file).  Does anyone know of a utility that can split
> a file this way (without reading it sequentially)?  Is this in gnu parallel?

GNU Parallel will do that except it will read it sequentially.

> It'd be nice to be able to take a list of mixed size files and divide them
> by size into N chunks of approximately equal lines, estimated using byte
> sizes and with an algorythm for searching for the record delimiter
> (linefeed) such that no records are lost.  Sort of a mixed input leveller
> for parallel loads.  If it is part of parallel, then parallel can launch
> processing for each chunk and to combine the chunks.

That is what --pipe does (except it reads sequentially):

  cat files* | parallel --pipe --block 10m wc

/Ole

From:	David
Subject:	Re: File divide to feed parallel
Date:	Thu, 27 Mar 2014 09:32:48 -0400 (EDT)