monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [monit] monit 5.0 beta4 bug - sends same message every cycle


From: Martin Pala
Subject: Re: [monit] monit 5.0 beta4 bug - sends same message every cycle
Date: Wed, 19 Nov 2008 21:52:10 +0100

Hi,

this bug should be fixed in cvs (it have found it yesterday). The problem is not new for beta4 - it is present in Monit since the event queue was implemented.

When the event queue was enabled, will will retry the delivery of the event next cycle - the event is stored in the queue. There is flag for the event which says whether just the mail delivery, mmonit event delivery or both (mail and mmonit) failed. The flag is stored in the queue along with the event. Monit retries the delivery in next cycle - note that only the delivery which failed is retried.

Now the problem with mail flooding: if both mail (for example temporary mail outage) and mmonit failed (for example mmonit is stopped or rejects the message), then the event in Monit queue is marked to be retried for mail and mmonit delivery. If one of these handlers succeeded, Monit marks the flag so that the next cycle the given handler shouldn't be retried since the message was delivered. If both handlers succeeded, the event file was removed from the queue. However - if only one handler succeeded, the event flag was not updated in the queue file and even though the delivery succeeded to one handler, Monit tried again next cycle because it read the flag from the file and "forgot" that the retry was successful n previous cycle.


Solution:
#######

The problem is fixed now in cvs, you can get the code here:
http://savannah.nongnu.org/cvs/?group=monit

We should release next monit beta soon (most probably within two weeks).


Possible workarounds:
##################

1.) if both mmonit and mail servers will be online, the delivery will succeed and event removed from the queue

2.) or if you want you can just remove the event from Monit's queue (delete the file /var/monit/1227064336_devel.kisise ... each queued event is stored in standalone file and can be removed at any time - Monit doesn't keep the queued events in the memory)




The changelog excerpt:

* If both event handlers (M/Monit and mail alerts) temporarily failed at once and the event queue is enabled, the event is stored in the queue and delivery
  retried next cycle. However, the delivery was retried every cycle for
both handlers if just one of them was recovered. Monit thus can deliver the same message multiple times until both handlers recovered. The problem is fixed now and only one copy of the event is sent even if only one handler
  recovered.


Martin



On Nov 19, 2008, at 12:31 PM, Aleksander Kamenik wrote:

Hi,

This bug occurs the second time now, the first time was on 13th Nov also beta4.


monit detects a high load at 05:12 (expected):

"Monit alert devel.kisise at Wed, 19 Nov 2008 05:12:11 +0200 on devel

 loadavg(1min) of 3.1 matches resource limit [loadavg(1min)>3.0]"

But this load stays there only for a minute, but instead of the resource succeeded message I get the same message the next cycle (55s). And the next cycle and the next one etc.

I got almost 400 messages, all exactly the same, before I arrived at work at noon and shut down monit.

monit unmonitor all did not stop the messages from being sent. monit summary showed that no services were monitored, but the messages still kept coming.

Shutting down monit stopped the messages, but as soon as I started monit up again, even with the services unmonitored, it started spamming me with the same message again. I tried to monitor and unmonitor again, but this did not help.

So this buggy state survives restarts.

I shut down monit again and here's my little investigation, note the bunch of *.devel.kisise files:

devel:/var/monit # pwd
/var/monit
devel:/var/monit # ll
total 136
-rw------- 1 root root 154 Nov 18 05:13 1226977982_devel.kisise
-rw------- 1 root root 152 Nov 18 05:39 1226979565_devel.kisise
-rw------- 1 root root 196 Nov 19 05:12 1227064336_devel.kisise
-rw------- 1 root root 154 Nov 19 05:12 1227064376_devel.kisise
-rw------- 1 root root 152 Nov 19 05:38 1227065904_devel.kisise
-rw------- 1 root root 156 Nov 19 10:31 1227083466_apache2_bin
-rw------- 1 root root 157 Nov 19 10:31 1227083466_apache2_init
-rw------- 1 root root 154 Nov 19 10:31 1227083466_bootfs
-rw------- 1 root root 152 Nov 19 10:31 1227083466_cron
-rw------- 1 root root 154 Nov 19 10:31 1227083466_devel.kisise
-rw------- 1 root root 160 Nov 19 10:31 1227083466_mysqld_bin
-rw------- 1 root root 161 Nov 19 10:31 1227083466_mysqld_init
-rw------- 1 root root 164 Nov 19 10:31 1227083466_mysqldsafe_bin
-rw------- 1 root root 157 Nov 19 10:31 1227083466_ntpd_bin
-rw------- 1 root root 158 Nov 19 10:31 1227083466_ntpd_init
-rw------- 1 root root 157 Nov 19 10:31 1227083466_postfix_bin
-rw------- 1 root root 158 Nov 19 10:31 1227083466_postfix_init
-rw------- 1 root root 154 Nov 19 10:31 1227083466_rootfs
-rw------- 1 root root 157 Nov 19 10:31 1227083466_samba_init
-rw------- 1 root root 154 Nov 19 10:31 1227083466_sshd_bin
-rw------- 1 root root 155 Nov 19 10:31 1227083466_sshd_init
-rw------- 1 root root 152 Nov 19 10:31 1227083469_apache2
-rw------- 1 root root 155 Nov 19 10:31 1227083469_mysql
-rw------- 1 root root 153 Nov 19 10:31 1227083469_ntpd
-rw------- 1 root root 153 Nov 19 10:31 1227083469_postfix
-rw------- 1 root root 161 Nov 19 10:31 1227083469_samba_smbd_bin
-rw------- 1 root root 150 Nov 19 10:31 1227083469_smb
-rw------- 1 root root 150 Nov 19 10:31 1227083469_sshd
-rw------- 1 root root 146 Nov 19 10:37 1227083853_devel.kisise
-rw------- 1 root root 146 Nov 19 11:22 1227086528_devel.kisise
-rw------- 1 root root 146 Nov 19 13:16 1227093413_devel.kisise
-rw------- 1 root root 152 Nov 19 13:17 1227093437_devel.kisise
-rw------- 1 root root 154 Nov 19 13:19 1227093593_devel.kisise
-rw------- 1 root root 146 Nov 19 13:21 1227093677_devel.kisise
devel:/var/monit # grep 3.1 *
Binary file 1227064336_devel.kisise matches
devel:/var/monit # strings 1227064336_devel.kisise
devel.kisise
loadavg(1min) of 3.1 matches resource limit [loadavg(1min)>3.0]
devel:/var/monit #

The only fortunate thing about this is, is that devel.kisise is the only box which sends only emails, but no sms. :)

This error obviously does not occer every night, it's the second time tonight though. The last time a proper restart of monit killed the bug though, this time not.

The box is running SLES10SP2 x86. This is monit 5.0 beta4, I'd say this bug was introduced in one of the last betas.

If you need any more info, ask.

Regards,

--

Aleksander Kamenik
System Administrator
Krediidiinfo AS
an Experian Company
Phone: +372 665 9649
Email: address@hidden

http://www.krediidiinfo.ee/
http://www.experiangroup.com/


--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general





reply via email to

[Prev in Thread] Current Thread [Next in Thread]