parallel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Catch OOM kills


From: John
Subject: Re: Catch OOM kills
Date: Mon, 7 May 2018 15:38:13 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.7.0

Thanks so much Ole and Douglas. Appreciated.

John

On 05/07/2018 08:56 AM, Ole Tange wrote:
On Fri, May 4, 2018 at 7:23 PM, Douglas Pagani
<redstonefreedom@gmail.com> wrote:
On Fri, May 4, 2018 at 8:42 AM, John <johnanders@posteo.de> wrote:
How can I catch if the program I have called with parallel gets killed by the 
kernel due to memory space.
:
I would like to have an option that returns me all the jobs that were not able 
to be finished. Is this possible?
You can use parallel --joblog ~/my.log to output several pieces of information about jobs. One of 
those pieces is "ExitVal", which will tell you not only that your job completed 
unsuccessfully, but with what exit code. For example, instead of having to check dmesg for a 
"Out of memory: Kill process ..." message, you can safely assume 143 is from linux's OOM 
killer having sent your process a SIGTERM (128 + 15).

I usually run an ad hoc script to pick up the "stragglers" after a larger run, 
by parsing that file for any non-zero ExitVal's, and re-invoking the full command line 
associated with it. Of course, if the exit code was due to something deterministic, 
you'll just get non-zeros again and again, without first fixing the problem with the 
data/args of those specific invocations first.
I would reckon that is a good approach. If your jobs have very varying
memory usage, then first run a lot of them in parallel:

   parallel --joblog my.log -j100% [...]

When that is done run all failed jobs again, but run only a single job
at a time to give it the most memory available:

   parallel --retry-failed --joblog my.log -j1

or:

   parallel --resume-failed --joblog my.log -j1 [...]

This last part is basically Douglas' ad hoc script.

The difference between --retry-failed and --resume-failed is described
in the man page.


/Ole





reply via email to

[Prev in Thread] Current Thread [Next in Thread]