[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Limiting memory used by parallel?

From: Ole Tange
Subject: Re: Limiting memory used by parallel?
Date: Sun, 28 Jan 2018 02:45:42 +0100

On Thu, Jan 25, 2018 at 4:33 PM, hubert depesz lubaczewski
<address@hidden> wrote:
> Hi,
> I'm writing a tool that will make a tarball, and then the tarball is
> passed to parallel, which splits it into 5GB blocks, and each block is
> sent to separate pipe.
> Call looks like:
> tar cf - /some/directory | parallel -j 5 --pipe --block 5G --recend '' 
> ./ "{#}"

--pipe keeps one block per process in memory, so the above should use
around 25 GB of RAM.

You can see the reason for this design by imagining jobs that reads
very slowly: You will want all 5 of these to be running, but you would
have to read (and buffer) at least 4*5 GB to start the 5th process,
and the code is cleaner if you simply read the full block for every

--pipepart does not use memory, so that is one way to avoid this.
--pipepart is extremely fast: It delivers around 1 GB/cpucore, so it
will most likely be limited by your disk speed:

  tar cf - /some/directory > bigfile.tar
  parallel --pipepart bigfile.tar --block 5G --recend ''
./ {#}

But I imagine you do not have space to keep an uncompressed copy of
the tarfile, and you really want to handle the parts _while_ tar is

You can also use --cat:

  tar cf - /some/directory | parallel -j 5 --pipe --block 5G --cat
--recend '' 'cat {} | ./ {#}'

This way each block is saved to the tempdir before the job starts. By
my limited testing this should make GNU Parallel only keep 1-2 blocks
in memory.


> depesz   11036 12.0  0.8 5291388 510088 pts/4  S+   15:11   0:00  |   \_ perl 
> /usr/bin/parallel -j 5 --no-notice --pipe --block 5G --recend  
> /home/depesz/ {#}

PS: Please consider running --bibtex once.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]