[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: slow to take action

From: Nick Upson
Subject: Re: slow to take action
Date: Sat, 18 Jun 2011 10:33:17 +0100

On 17 June 2011 23:16, Jan-Henrik Haukeland <address@hidden> wrote:

On Jun 17, 2011, at 8:07 PM, Nick Upson wrote:

> Hi,
> I have a monit configuration where it is monitoring 25 hosts (ping
> test) and several local processes.
> doing anything with monit except a summary takes a long time. It seems
> that the tests are each done sequentially
> a) this means that there is the possibility of one set of tests not
> being complete when the next is due to start as the number of hosts increases
> b) restarting a local process takes too long
> Is there any way I can adjust the configuration to improve the situation?

a) Monit run all test in a single thread and serial. This means that the list of tests is run from start to finish. If some tests take a long time to complete it just means that Monit will take longer to run through the list of tests. What is important is that each and every test is run and Monit will do that. What is (usually) less important is if a test run a bit later depending on how long previous tests take.

b) Monit forks a new process and this operation take just milliseconds, but Monit will wait, if I remember correct, up to one poll cycle to see if the process comes up. If your program is slow to start (from Monit's POV that is, create the pidfile) then this will delay all the tests since, as mentioned, testing is single threaded. So yeah, this model may be improved [1]. But there are a few things you can do now, for instance make sure that the program write its pidfile as soon as possible. If you cannot modify the program, create a wrapper script that write the pidfile first and then do an exec on the program.

You may also fiddle with connection timeout in the configuration, but if set too low you may risk false positive alerts which is probably worse.

1. We are about to release a new version of Monit in a short while which implement a new 'check program' which is meant to be used to check the exit status of a script or program. This implementation uses another model which does not delay other tests and we may use this also when checking processes.

Best regards
Jan-Henrik Haukeland
☏: +47 97141255

Sorry I wasn't clear, the reason that restarting a local process takes a long time is that monit is stepping through all the remote host checks, the status says that the operation is pending, so it has received the instruction, rather than just recording the request and doing it later, could monit take the action immediately and then return to checking the remote hosts.

the check is

 if failed icmp type echo count 3 with timeout 5 seconds for 3 cycles then

so I suppose it's taking 3 x 5 = 15 seconds to decide that it's failed and move on to the next one

when monit is on a 2 min cycle is that start the cycle 2 mins after finishing the last one or run at 2 min intervals with the risk of overlapping with the previous cycle

Nick Upson (01799 533252)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]