monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Monit dependency problem (bug?)


From: drich
Subject: Re: Monit dependency problem (bug?)
Date: Tue, 13 Dec 2011 13:37:49 -0800
User-agent: Roundcube Webmail/0.6

I think I know what is happening, but I'm not sure of how to fix it (yet, I hope).

Running "monit stop ospfd" is causing monit to wake up and start processing, which then triggers the "if does not exist" in the apache block. Is there a way to only have monit execute that block one time or only execute it on state change? I'm assuming that "if recovered" will only happen when the application first recovers and not every time it is up (is that a valid assumption? it isn't in the docs that I can find), so is there an equivalent for "if does not exist"?

Is there any choice other than creating a semaphore file and doing something like:

  if does not exist
    then exec "/bin/bash -c 'if [ ! -f /tmp/monit.apachedown ]; then touch /tmp/monit.apachedown; /usr/bin/monit stop ospfd; fi'"
    else if recovered then exec "/bin/bash -c 'rm /tmp/monit.apachedown && /usr/bin/monit monitor ospfd'"

My big concern with that is getting into a state where apache is up and the file still exists, so ospfd will not go down if apache fails.

The good thing about the above is that I can add the dependency statements back to my ospfd config and that it does bring ospfd down when apache fails.

The downside is that it never runs the restart. Can you see anything wrong with the following block that would prevent it from trying to restart apache? If I explicitly run "monit restart apache" it will restart, delete the semaphore and restart ospfd; but it will never do it by itself. Does the "does not exist" check succeeding prevent the "if failed" check from running? I don't ever see a timeout in the logs.

check process apache with pidfile /var/run/httpd.pid
  start program = "/etc/init.d/httpd start"
  stop program  = "/etc/init.d/httpd stop"
  if does not exist
    then exec "/bin/bash -c 'if [ ! -f /tmp/monit.apachedown ]; then touch /tmp/monit.apachedown; /usr/bin/monit stop ospfd; fi'"
    else if recovered then exec "/bin/bash -c 'rm /tmp/monit.apachedown && /usr/bin/monit monitor ospfd'"
  if failed host localhost port 80 protocol http
     and request "/" then restart
  if children > 50 then restart
  if 2 restarts within 2 cycles then timeout
  group server
  depends on tomcat

And the log from an httpd failure says:

Dec 13 13:18:22 tecate monit[13602]: 'apache' process is not running
Dec 13 13:18:22 tecate monit[13602]: 'apache' exec: /bin/bash
Dec 13 13:18:22 tecate monit[13602]: 'ospfd' stop on user request
Dec 13 13:18:22 tecate monit[13602]: monit daemon at 13602 awakened
Dec 13 13:18:22 tecate monit[13602]: Awakened by User defined signal 1
Dec 13 13:18:22 tecate monit[13602]: 'ospfd' stop: /etc/init.d/ospfd
Dec 13 13:18:22 tecate monit[13602]: 'ospfd' stop action done
Dec 13 13:18:22 tecate monit[13602]: 'apache' process is not running
Dec 13 13:18:22 tecate monit[13602]: 'apache' exec: /bin/bash
Dec 13 13:18:22 tecate monit[13602]: 'ospfd' unmonitor on user request
Dec 13 13:18:22 tecate monit[13602]: monit daemon at 13602 awakened
Dec 13 13:18:22 tecate monit[13602]: Awakened by User defined signal 1
Dec 13 13:18:22 tecate monit[13602]: 'ospfd' unmonitor action done
Dec 13 13:18:22 tecate monit[13602]: 'apache' process is not running
Dec 13 13:18:22 tecate monit[13602]: 'apache' exec: /bin/bash
Dec 13 13:19:22 tecate monit[13602]: 'apache' process is not running
Dec 13 13:19:22 tecate monit[13602]: 'apache' exec: /bin/bash
Dec 13 13:20:22 tecate monit[13602]: 'apache' process is not running
Dec 13 13:20:22 tecate monit[13602]: 'apache' exec: /bin/bash
... which repeats until I run monit restart apache ...



On 08.12.2011 09:11, drich wrote:

Eric,

That's where I started - the problem with that is that it will start ospf every time apache fails to restart. I end up with entries in the log like:

Dec  6 08:47:39 tecate monit[9988]: 'apache' process is not running
Dec  6 08:47:39 tecate monit[9988]: 'apache' trying to restart
Dec  6 08:47:39 tecate monit[9988]: 'ospfd' stop: /etc/init.d/ospfd
Dec  6 08:47:39 tecate monit[9988]: 'apache' start: /etc/init.d/httpd
Dec  6 08:47:40 tecate monit[9988]: 'ospfd' unmonitor on user request
Dec  6 08:47:40 tecate monit[9988]: monit daemon at 9988 awakened
Dec  6 08:48:09 tecate monit[9988]: 'apache' failed to start
Dec  6 08:48:09 tecate monit[9988]: 'ospfd' start: /etc/init.d/ospfd
Dec  6 08:48:09 tecate monit[9988]: 'ospfd' unmonitor action done
Dec  6 08:48:09 tecate monit[9988]: Awakened by User defined signal 1

The biggest problem is when this happens it leaves ospfd running even if apache isn't. Martin commented that dependencies are "soft", they define the start/stop order but don't wait for the parent to recover before starting the dependent service.

I'm going to take a look at the code today, the problem I'm seeing right now looks like a race condition. My guess is that it when I call "monit stop ospfd" it hasn't yet marked apache as not existing, so the "if does not exist" block is being executed again and again and again.

Here is the config I am working with now:

check process apache with pidfile /var/run/httpd.pid
  start program = "/etc/init.d/httpd start"
  stop program  = "/etc/init.d/httpd stop"
  if does not exist
    then exec "/usr/bin/monit stop ospfd"
    else if recovered then exec "/usr/bin/monit monitor ospfd"
  if failed host localhost port 80 protocol http
     and request "/" then restart
  if children > 50 then restart
  if 2 restarts within 2 cycles then timeout
  group server

check process ospfd with pidfile /var/run/quagga/ospfd.pid
  start program = "/etc/init.d/ospfd start"
  stop program  = "/etc/init.d/ospfd stop"
  group network


On 08.12.2011 00:10, Eric Pailleau wrote:

Hello,
did you simply try this ?

---8  50 then restart
    if 2 restarts within 2 cycles then timeout
    group server
    depends on tomcat
check process ospfd with pidfile /var/run/quagga/ospfd.pid
    start program = "/etc/init.d/ospfd start"
    stop program  = "/etc/init.d/ospfd stop"
    depends on apache
    depends on fcserver
    depends on mysql
    depends on tomcat
    group network
---8Taking out the depends doesn't make a difference, it still stays in that loop where it is spewing to the logs.

I'm off-site today, I'll look at this more tomorrow morning when I can pay attention to it rather than to the lecture I'm supposed to be listening to. :-)

On 07.12.2011 13:13, Martin Pala wrote:
Yes, it Eric is correct. The "monit stop…" in the exec action cannot be combined in this case with the "depends on…"

 

--
Dan Rich
http://www.employees.org/~drich/
"Step up to red alert!" "Are you sure, sir?
It means changing the bulb in the sign..."

      - Red Dwarf (BBC)

 

--
Dan Rich <address@hidden>
http://www.employees.org/~drich/
"Step up to red alert!" "Are you sure, sir?
It means changing the bulb in the sign..."

      - Red Dwarf (BBC)

reply via email to

[Prev in Thread] Current Thread [Next in Thread]