Re: Catch OOM kills

parallel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Catch OOM kills

From:	Ole Tange
Subject:	Re: Catch OOM kills
Date:	Mon, 7 May 2018 08:56:57 +0200

On Fri, May 4, 2018 at 7:23 PM, Douglas Pagani
<redstonefreedom@gmail.com> wrote:
> On Fri, May 4, 2018 at 8:42 AM, John <johnanders@posteo.de> wrote:
>>
>> How can I catch if the program I have called with parallel gets killed by 
>> the kernel due to memory space.
:
>> I would like to have an option that returns me all the jobs that were not 
>> able to be finished. Is this possible?

> You can use parallel --joblog ~/my.log to output several pieces of 
> information about jobs. One of those pieces is "ExitVal", which will tell you 
> not only that your job completed unsuccessfully, but with what exit code. For 
> example, instead of having to check dmesg for a "Out of memory: Kill process 
> ..." message, you can safely assume 143 is from linux's OOM killer having 
> sent your process a SIGTERM (128 + 15).
>
> I usually run an ad hoc script to pick up the "stragglers" after a larger 
> run, by parsing that file for any non-zero ExitVal's, and re-invoking the 
> full command line associated with it. Of course, if the exit code was due to 
> something deterministic, you'll just get non-zeros again and again, without 
> first fixing the problem with the data/args of those specific invocations 
> first.

I would reckon that is a good approach. If your jobs have very varying
memory usage, then first run a lot of them in parallel:

  parallel --joblog my.log -j100% [...]

When that is done run all failed jobs again, but run only a single job
at a time to give it the most memory available:

  parallel --retry-failed --joblog my.log -j1

or:

  parallel --resume-failed --joblog my.log -j1 [...]

This last part is basically Douglas' ad hoc script.

The difference between --retry-failed and --resume-failed is described
in the man page.


/Ole

[Prev in Thread]

Current Thread

[Next in Thread]

Catch OOM kills, John, 2018/05/04
- Re: Catch OOM kills, Douglas Pagani, 2018/05/04
  - Re: Catch OOM kills, Ole Tange <=
    - Re: Catch OOM kills, John, 2018/05/07

Prev by Date: Re: Catch OOM kills
Next by Date: Re: Catch OOM kills
Previous by thread: Re: Catch OOM kills
Next by thread: Re: Catch OOM kills
Index(es):
- Date
- Thread