[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Catch OOM kills

From: Douglas Pagani
Subject: Re: Catch OOM kills
Date: Fri, 4 May 2018 13:23:20 -0400

On Fri, May 4, 2018 at 8:42 AM, John <address@hidden> wrote:
Dear all

How can I catch if the program I have called with parallel gets killed by the kernel due to memory space.
I know the option --memfree. I am not sure if this satisfies all my needs. For example what happens if one of my jobs was always put back into the queue and is the last one now. And even though he has now all the memory available it still gets killed.
I would like to have an option that returns me all the jobs that were not able to be finished. Is this possible?


You can use parallel --joblog ~/my.log to output several pieces of information about jobs. One of those pieces is "ExitVal", which will tell you not only that your job completed unsuccessfully, but with what exit code. For example, instead of having to check dmesg for a "Out of memory: Kill process ..." message, you can safely assume 143 is from linux's OOM killer having sent your process a SIGTERM (128 + 15).

I usually run an ad hoc script to pick up the "stragglers" after a larger run, by parsing that file for any non-zero ExitVal's, and re-invoking the full command line associated with it. Of course, if the exit code was due to something deterministic, you'll just get non-zeros again and again, without first fixing the problem with the data/args of those specific invocations first.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]