[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [monit] monit 5.0 beta4 bug - sends same message every cycle
From: |
Martin Pala |
Subject: |
Re: [monit] monit 5.0 beta4 bug - sends same message every cycle |
Date: |
Wed, 19 Nov 2008 21:52:10 +0100 |
Hi,
this bug should be fixed in cvs (it have found it yesterday). The
problem is not new for beta4 - it is present in Monit since the event
queue was implemented.
When the event queue was enabled, will will retry the delivery of the
event next cycle - the event is stored in the queue. There is flag for
the event which says whether just the mail delivery, mmonit event
delivery or both (mail and mmonit) failed. The flag is stored in the
queue along with the event. Monit retries the delivery in next cycle -
note that only the delivery which failed is retried.
Now the problem with mail flooding: if both mail (for example
temporary mail outage) and mmonit failed (for example mmonit is
stopped or rejects the message), then the event in Monit queue is
marked to be retried for mail and mmonit delivery. If one of these
handlers succeeded, Monit marks the flag so that the next cycle the
given handler shouldn't be retried since the message was delivered. If
both handlers succeeded, the event file was removed from the queue.
However - if only one handler succeeded, the event flag was not
updated in the queue file and even though the delivery succeeded to
one handler, Monit tried again next cycle because it read the flag
from the file and "forgot" that the retry was successful n previous
cycle.
Solution:
#######
The problem is fixed now in cvs, you can get the code here:
http://savannah.nongnu.org/cvs/?group=monit
We should release next monit beta soon (most probably within two weeks).
Possible workarounds:
##################
1.) if both mmonit and mail servers will be online, the delivery will
succeed and event removed from the queue
2.) or if you want you can just remove the event from Monit's queue
(delete the file /var/monit/1227064336_devel.kisise ... each queued
event is stored in standalone file and can be removed at any time -
Monit doesn't keep the queued events in the memory)
The changelog excerpt:
* If both event handlers (M/Monit and mail alerts) temporarily failed
at once
and the event queue is enabled, the event is stored in the queue
and delivery
retried next cycle. However, the delivery was retried every cycle for
both handlers if just one of them was recovered. Monit thus can
deliver the
same message multiple times until both handlers recovered. The
problem is
fixed now and only one copy of the event is sent even if only one
handler
recovered.
Martin
On Nov 19, 2008, at 12:31 PM, Aleksander Kamenik wrote:
Hi,
This bug occurs the second time now, the first time was on 13th Nov
also beta4.
monit detects a high load at 05:12 (expected):
"Monit alert devel.kisise at Wed, 19 Nov 2008 05:12:11 +0200 on devel
loadavg(1min) of 3.1 matches resource limit [loadavg(1min)>3.0]"
But this load stays there only for a minute, but instead of the
resource succeeded message I get the same message the next cycle
(55s). And the next cycle and the next one etc.
I got almost 400 messages, all exactly the same, before I arrived at
work at noon and shut down monit.
monit unmonitor all did not stop the messages from being sent. monit
summary showed that no services were monitored, but the messages
still kept coming.
Shutting down monit stopped the messages, but as soon as I started
monit up again, even with the services unmonitored, it started
spamming me with the same message again. I tried to monitor and
unmonitor again, but this did not help.
So this buggy state survives restarts.
I shut down monit again and here's my little investigation, note the
bunch of *.devel.kisise files:
devel:/var/monit # pwd
/var/monit
devel:/var/monit # ll
total 136
-rw------- 1 root root 154 Nov 18 05:13 1226977982_devel.kisise
-rw------- 1 root root 152 Nov 18 05:39 1226979565_devel.kisise
-rw------- 1 root root 196 Nov 19 05:12 1227064336_devel.kisise
-rw------- 1 root root 154 Nov 19 05:12 1227064376_devel.kisise
-rw------- 1 root root 152 Nov 19 05:38 1227065904_devel.kisise
-rw------- 1 root root 156 Nov 19 10:31 1227083466_apache2_bin
-rw------- 1 root root 157 Nov 19 10:31 1227083466_apache2_init
-rw------- 1 root root 154 Nov 19 10:31 1227083466_bootfs
-rw------- 1 root root 152 Nov 19 10:31 1227083466_cron
-rw------- 1 root root 154 Nov 19 10:31 1227083466_devel.kisise
-rw------- 1 root root 160 Nov 19 10:31 1227083466_mysqld_bin
-rw------- 1 root root 161 Nov 19 10:31 1227083466_mysqld_init
-rw------- 1 root root 164 Nov 19 10:31 1227083466_mysqldsafe_bin
-rw------- 1 root root 157 Nov 19 10:31 1227083466_ntpd_bin
-rw------- 1 root root 158 Nov 19 10:31 1227083466_ntpd_init
-rw------- 1 root root 157 Nov 19 10:31 1227083466_postfix_bin
-rw------- 1 root root 158 Nov 19 10:31 1227083466_postfix_init
-rw------- 1 root root 154 Nov 19 10:31 1227083466_rootfs
-rw------- 1 root root 157 Nov 19 10:31 1227083466_samba_init
-rw------- 1 root root 154 Nov 19 10:31 1227083466_sshd_bin
-rw------- 1 root root 155 Nov 19 10:31 1227083466_sshd_init
-rw------- 1 root root 152 Nov 19 10:31 1227083469_apache2
-rw------- 1 root root 155 Nov 19 10:31 1227083469_mysql
-rw------- 1 root root 153 Nov 19 10:31 1227083469_ntpd
-rw------- 1 root root 153 Nov 19 10:31 1227083469_postfix
-rw------- 1 root root 161 Nov 19 10:31 1227083469_samba_smbd_bin
-rw------- 1 root root 150 Nov 19 10:31 1227083469_smb
-rw------- 1 root root 150 Nov 19 10:31 1227083469_sshd
-rw------- 1 root root 146 Nov 19 10:37 1227083853_devel.kisise
-rw------- 1 root root 146 Nov 19 11:22 1227086528_devel.kisise
-rw------- 1 root root 146 Nov 19 13:16 1227093413_devel.kisise
-rw------- 1 root root 152 Nov 19 13:17 1227093437_devel.kisise
-rw------- 1 root root 154 Nov 19 13:19 1227093593_devel.kisise
-rw------- 1 root root 146 Nov 19 13:21 1227093677_devel.kisise
devel:/var/monit # grep 3.1 *
Binary file 1227064336_devel.kisise matches
devel:/var/monit # strings 1227064336_devel.kisise
devel.kisise
loadavg(1min) of 3.1 matches resource limit [loadavg(1min)>3.0]
devel:/var/monit #
The only fortunate thing about this is, is that devel.kisise is the
only box which sends only emails, but no sms. :)
This error obviously does not occer every night, it's the second
time tonight though. The last time a proper restart of monit killed
the bug though, this time not.
The box is running SLES10SP2 x86. This is monit 5.0 beta4, I'd say
this bug was introduced in one of the last betas.
If you need any more info, ask.
Regards,
--
Aleksander Kamenik
System Administrator
Krediidiinfo AS
an Experian Company
Phone: +372 665 9649
Email: address@hidden
http://www.krediidiinfo.ee/
http://www.experiangroup.com/
--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general