[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Catch OOM kills

From: Ole Tange
Subject: Re: Catch OOM kills
Date: Mon, 7 May 2018 08:56:57 +0200

On Fri, May 4, 2018 at 7:23 PM, Douglas Pagani
<address@hidden> wrote:
> On Fri, May 4, 2018 at 8:42 AM, John <address@hidden> wrote:
>> How can I catch if the program I have called with parallel gets killed by 
>> the kernel due to memory space.
>> I would like to have an option that returns me all the jobs that were not 
>> able to be finished. Is this possible?

> You can use parallel --joblog ~/my.log to output several pieces of 
> information about jobs. One of those pieces is "ExitVal", which will tell you 
> not only that your job completed unsuccessfully, but with what exit code. For 
> example, instead of having to check dmesg for a "Out of memory: Kill process 
> ..." message, you can safely assume 143 is from linux's OOM killer having 
> sent your process a SIGTERM (128 + 15).
> I usually run an ad hoc script to pick up the "stragglers" after a larger 
> run, by parsing that file for any non-zero ExitVal's, and re-invoking the 
> full command line associated with it. Of course, if the exit code was due to 
> something deterministic, you'll just get non-zeros again and again, without 
> first fixing the problem with the data/args of those specific invocations 
> first.

I would reckon that is a good approach. If your jobs have very varying
memory usage, then first run a lot of them in parallel:

  parallel --joblog my.log -j100% [...]

When that is done run all failed jobs again, but run only a single job
at a time to give it the most memory available:

  parallel --retry-failed --joblog my.log -j1


  parallel --resume-failed --joblog my.log -j1 [...]

This last part is basically Douglas' ad hoc script.

The difference between --retry-failed and --resume-failed is described
in the man page.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]