[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Catch OOM kills
From: |
Ole Tange |
Subject: |
Re: Catch OOM kills |
Date: |
Mon, 7 May 2018 08:56:57 +0200 |
On Fri, May 4, 2018 at 7:23 PM, Douglas Pagani
<redstonefreedom@gmail.com> wrote:
> On Fri, May 4, 2018 at 8:42 AM, John <johnanders@posteo.de> wrote:
>>
>> How can I catch if the program I have called with parallel gets killed by
>> the kernel due to memory space.
:
>> I would like to have an option that returns me all the jobs that were not
>> able to be finished. Is this possible?
> You can use parallel --joblog ~/my.log to output several pieces of
> information about jobs. One of those pieces is "ExitVal", which will tell you
> not only that your job completed unsuccessfully, but with what exit code. For
> example, instead of having to check dmesg for a "Out of memory: Kill process
> ..." message, you can safely assume 143 is from linux's OOM killer having
> sent your process a SIGTERM (128 + 15).
>
> I usually run an ad hoc script to pick up the "stragglers" after a larger
> run, by parsing that file for any non-zero ExitVal's, and re-invoking the
> full command line associated with it. Of course, if the exit code was due to
> something deterministic, you'll just get non-zeros again and again, without
> first fixing the problem with the data/args of those specific invocations
> first.
I would reckon that is a good approach. If your jobs have very varying
memory usage, then first run a lot of them in parallel:
parallel --joblog my.log -j100% [...]
When that is done run all failed jobs again, but run only a single job
at a time to give it the most memory available:
parallel --retry-failed --joblog my.log -j1
or:
parallel --resume-failed --joblog my.log -j1 [...]
This last part is basically Douglas' ad hoc script.
The difference between --retry-failed and --resume-failed is described
in the man page.
/Ole