[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: File divide to feed parallel
From: |
Ole Tange |
Subject: |
Re: File divide to feed parallel |
Date: |
Thu, 27 Mar 2014 16:04:51 +0100 |
On Thu, Mar 27, 2014 at 2:32 PM, David <dgpickett@aol.com> wrote:
> Ole,
>
> Yes, the idea is to level the parallel loads without a single point
> bottleneck of a serial reader. In the world of big data, you want the
> parallel processes to use their logical id to seek to the desired position
> in the desired starting file, find the first new record, and read through
> the record containing the desired end offset. Once the file division master
> thread builds a listing of file names, sizes, and sets the chunk size, then
> the parallel threads/processes can be created and begin reading in parallel.
I understand the idea. It does require the input to be a file, and not
a pipe, and currently GNU Parallel does not support that.
If GNU Parallel was to support it, I imagine I would have a process
that would figure out where to chop the blocks, and then pass that to
the main program which would then start a 'dd skip=XXX if=file |
your_program'
The file could possibly be given as -a argument to --pipe:
parallel -a bigfile --pipe --block 1g --recend '\n\n' yourprogram
# Not implemented
If that was implemented, what should this do (multiple -a):
parallel -a file1 -a file2 --pipe --block 1g --recend '\n\n'
yourprogram # Not implemented
> Reading sequentially and sending the records down pipes in an array in
> rotation is an alternative,
That is what --round-robin does now.
> but prone to several problems: 1) One slow pipe
> can block the reading of input. It might be possible to skip slow pipes
> with some sort of per pipe buffering and non-blocking i/o. I wrote a
> buffering pipe fitting that can help soften this, but that adds overhead
> with an extra pipe and process.
I can highly recommand mbuffer: extremely small overhead.
> 2) Sometimes each parallel processing is
> not N times slower than reading a file and writing a pipe. 3) The read is
> not subject to any parallelism to speed it.
Yep. All true.
> Reading file names and assigning them to parallel threads in size descending
> order in zigzag rotation (1 to N to 1 to N . . . ) for size leveling has
> parallel reading, but despite size leveling, often the largest files
> dominate the run time. If there are not N files, there will not be any N
> way parallelism.
I am wondering if that really is a job for GNU Parallel? I often use
GNU Parallel for tasks, where file size does not matter at all (e.g.
rename a file).
Would it not make more sense if you sorted the input by file size?
ls files | sort --by-size | parallel 'your program < {}'
find . -type f | perl -e 'print map {$_,"\n"} sort { chomp($a,$b);
-s $a <=> -s $b } <>' | parallel -k ls -l
> It might be nice to have an option to have chunk sizes increased to modulo
> 8192
--block 1M = --block 1048576, so try this:
cat bigfile | parallel --pipe --block 1M --recend '' wc
> or the like so pages are less split, but really, if there is a
> delimiter, chunk edge pages are always split.
Yep.
> An option for undelimited, fixed length records could provide the record
> size, so chunks could always be in modulo-record-size bytes.
Elaborate why '--recend "" --block 8k --pipe' does not solve that.
> Does parallel ever worry about unicode, euc and such that might need to work
> in n-byte or variable wide characters? I guess if you knew it was a utf-8
> file, you could find the character boundaries, but not all systems have such
> nice middle of file sync indicators.
GNU Parallel passes that worry on to Perl. So nothing in GNU Parallel
specifically deals with multibyte charsets.
/Ole