monit-general
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: problem with apache2 monitoring (I think)


From: Mike Jackson
Subject: Re: problem with apache2 monitoring (I think)
Date: Thu, 20 Apr 2006 16:08:08 -0700

I'm just guessing here, but I'm wondering if monit is triggering because your server has hit the MaxChildren in Apache and has queued its request, so monit thinks Apache is down because it's taking too long to respond. I've had similar problems, so here's how I currently have monit configured to monitor apache:

check process apache with pidfile /usr/local/apache/logs/httpd.pid
   start program = "/etc/rc.d/init.d/httpd start"
   stop program  = "/etc/rc.d/init.d/httpd stop"
if failed url http://www.example.com/ content == "example content" timeout 30 seconds 2 times within 3 cycles then restart if failed url https://www.example.com/ content == "example content" timeout 30 seconds 2 times within 3 cycles then restart

Between the longer timeout and 2-out-of-3-cycles requirements, my false positives have become nearly nonexistent.

Also, I emailed the list a few days ago about a problem restarting Apache. apachectl returns very quickly from stopping the server, sometimes before all the children have exited. I added a sleep statement to the stop() action in my init.d script to make sure that it had enough time to shut down before returning to Monit.


----- Original Message ----- From: "alex black" <address@hidden>
To: "This is the general mailing list for monit" <address@hidden>
Cc: "Aktiom Networks Support" <address@hidden>
Sent: Thursday, April 20, 2006 15:42
Subject: problem with apache2 monitoring (I think)


hi all,

I have a very, very _very_ strange problem with apache2. It appears to be going down randomly, and monit seems to be sort of screwing up the restart. I'll explain:

Monit has for a while been reporting that its connection to apache fails, tries to restart it, the restart fails, then tries to start it, then successfully connects. This process takes over 2 minutes in some cases (I have a 2min monit monitoring interval).

Analysis of apache logs and var/log/* wasn't turning up much of anything, so I wrote a little script that runs from cron every 15 seconds. It does a simple grep on curl mysite.com and if it can't connect will send a very useful, detailed dump of the state of the machine (monit status output for apache, process list, specific grep of the last 100 log entries from apache logs, etc).

Here's an excerpt from syslog - note that the "logger" entries with apache2 test PASSED|FAILED are my little custom script which is running every 15 seconds.

################ /var/log/syslog

Apr 20 16:05:46 myhost last message repeated 3 times
Apr 20 16:06:02 myhost logger: apache2 test PASSED
Apr 20 16:06:33 myhost last message repeated 2 times
Apr 20 16:06:47 myhost logger: apache2 test PASSED
Apr 20 16:07:02 myhost logger: apache2 test PASSED
Apr 20 16:07:46 myhost last message repeated 3 times
Apr 20 16:08:02 myhost logger: apache2 test PASSED
Apr 20 16:08:46 myhost last message repeated 3 times
Apr 20 16:09:01 myhost /USR/SBIN/CRON[17282]: (root) CMD ( [ -d /var/ lib/php4 ] && find /var/lib/php4/ -type f -cmin +$(/usr/lib/php4/ maxlifetime) -print0 | xargs -r -0 rm)

### remove old session files

Apr 20 16:09:03 myhost logger: apache2 test PASSED
Apr 20 16:09:17 myhost logger: apache2 test PASSED
Apr 20 16:09:34 myhost logger: apache2 test PASSED
Apr 20 16:09:42 myhost monit[24001]: HTTP: error receiving data -- Resource temporarily unavailable Apr 20 16:09:42 myhost monit[24001]: 'apache2' failed protocol test [HTTP] at INET[www.mywebsite.com:80].
Apr 20 16:09:42 myhost monit[24001]: 'apache2' trying to restart
Apr 20 16:09:42 myhost monit[24001]: 'apache2' stop: /etc/init.d/apache2
Apr 20 16:09:46 myhost monit[24001]: 'apache2' start: /etc/init.d/ apache2

## note here that apache is not running for a FULL 2 minutes - ouch

Apr 20 16:09:46 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:10:01 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:10:16 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:10:31 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:10:46 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:11:01 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:11:16 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:11:31 myhost logger: apache2 test FAILED - sending  notification
Apr 20 16:11:46 myhost logger: apache2 test FAILED - sending  notification

## samples of the notification output from my script are below.

Apr 20 16:11:46 myhost monit[24001]: 'apache2' process is not running
Apr 20 16:11:46 myhost monit[24001]: 'apache2' trying to restart
Apr 20 16:11:46 myhost monit[24001]: 'apache2' start: /etc/init.d/ apache2
Apr 20 16:11:47 myhost monit[14467]: 'apache2' failed to start

## I'm guessing monit just doesn't wait long enough to run the protocol test ## once the service fails, which is why it thinks that it failed to start.

Apr 20 16:11:48 myhost monit[32720]: 'apache2' started
Apr 20 16:12:02 myhost logger: apache2 test PASSED
Apr 20 16:12:47 myhost last message repeated 3 times
Apr 20 16:13:01 myhost logger: apache2 test PASSED
Apr 20 16:13:46 myhost last message repeated 3 times
Apr 20 16:13:46 myhost monit[24001]: 'apache2' process is running with pid 32724
Apr 20 16:13:50 myhost monit[24001]: 'apache2' connection passed
Apr 20 16:14:02 myhost logger: apache2 test PASSED
Apr 20 16:14:47 myhost last message repeated 3 times
Apr 20 16:15:02 myhost logger: apache2 test PASSED
Apr 20 16:15:46 myhost last message repeated 3 times
Apr 20 16:16:02 myhost logger: apache2 test PASSED
Apr 20 16:16:47 myhost last message repeated 3 times
Apr 20 16:17:01 myhost /USR/SBIN/CRON[24860]: (root) CMD ( run- parts --report /etc/cron.hourly)
Apr 20 16:17:02 myhost logger: apache2 test PASSED
Apr 20 16:17:46 myhost last message repeated 3 times
Apr 20 16:18:02 myhost logger: apache2 test PASSED
#etc...


################ last output from my notification script above

MONIT STATUS
------------------------------------------
The monit daemon 4.5 uptime: 20d 19h 7m

System 'lamp.turingstudio.com'
  load average                      [0.39] [0.23] [0.16]
  cpu                               21.3%us 9.5%sy 0.0%wa
  memory usage                      4069324 kB [68.4%]
  data collected                    Thu Apr 20 16:09:30 2006

Process 'apache2'
  status                            Connection failed
  monitoring status                 monitored
  pid                               -1
  parent pid                        -1
  uptime                            0m
  childrens                         0
  memory kilobytes                  0
  memory kilobytes total            0
  memory percent                    0.0%
  memory percent total              0.0%
  cpu percent                       0.0%
  cpu percent total                 0.0%
port response time -1.000s to www.mywebsite.com:80/ catalog/list.php [HTTP]
  data collected                    Thu Apr 20 16:09:46 2006

File 'apache2_init'
  status                            accessible
  monitoring status                 monitored
  permission                        700
  uid                               0
  gid                               0
  timestamp                         Wed Mar  8 17:55:52 2006
  size                              3583 B
  data collected                    Thu Apr 20 16:07:30 2006

File 'apache2_bin'
  status                            accessible
  monitoring status                 monitored
  permission                        755
  uid                               0
  gid                               0
  timestamp                         Sun Oct 30 15:11:32 2005
  size                              385484 B
  data collected                    Thu Apr 20 16:07:30 2006

PROCESS LIST
------------------------------------------
  PID TTY      STAT   TIME COMMAND
    1 ?        Ss     9:13 ini
13870 ?        Ss     1:37 /sbin/syslogd
20745 ?        S      0:00 /bin/sh /usr/bin/mysqld_safe
20781 ?        S      0:00  \_ /bin/sh /usr/bin/mysqld_safe
20782 ? S 0:13 \_ /usr/sbin/mysqld --basedir=/usr -- datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/ mysqld.pid --skip-locking --port=3306 --socket=/var/run/mysqld/ mysqld.sock 20784 ? S 6:55 | \_ /usr/sbin/mysqld --basedir=/ usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/run/mysqld/ mysqld.pid --skip-locking --port=3306 --socket=/var/run/mysqld/ mysqld.sock 20785 ? S 0:00 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20786 ? S 0:00 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20787 ? S 0:00 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20788 ? S 0:00 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20801 ? S 97:36 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20802 ? S 70:01 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20803 ? S 0:00 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20804 ? S 0:22 | \_ /usr/sbin/mysqld -- basedir=/usr --datadir=/var/lib/mysql --user=mysql --pid-file=/var/ run/mysqld/mysqld.pid --skip-locking --port=3306 --socket=/var/run/ mysqld/mysqld.sock 20783 ? S 0:00 \_ logger -p daemon.err -t mysqld_safe -i -t mysqld
29792 ?        Ss     0:39 /usr/sbin/sshd
13825 ?        Ss     0:00  \_ sshd: turing [priv]
14435 ?        S      0:00      \_ sshd: address@hidden/0
14625 pts/0    Ss     0:00          \_ -bash
14917 pts/0    S+     0:00              \_ bash
31705 ?        Ss     1:13 /usr/lib/postfix/master
31708 ?        S      0:38  \_ qmgr -l -t fifo -u -c
 3397 ?        S      0:00  \_ pickup -l -t fifo -u -c
 2496 ?        S      0:00  \_ cleanup -z -t unix -u -c
 2624 ?        S      0:00  \_ trivial-rewrite -n rewrite -t unix -u -c
 2883 ?        S      0:00  \_ smtp -t unix -u -c
 3074 ?        S      0:00  \_ smtp -t unix -u -c
 6016 ?        S      0:00  \_ proxymap -t unix -u
24001 ?        Ss     5:34 /usr/sbin/monit -Ic /etc/monit/monitrc
24033 ?        S      0:50  \_ /usr/sbin/monit -Ic /etc/monit/monitrc
24066 ?        S      1:40      \_ /usr/sbin/monit -Ic /etc/monit/ monitrc
14467 ?        S      0:00      \_ /usr/sbin/monit -Ic /etc/monit/ monitrc
12992 ?        Ss     0:04 /usr/sbin/cron
18441 ?        S      0:00  \_ /USR/SBIN/CRON
18446 ? Ss 0:00 \_ /bin/sh -c sleep 45; /root/lamp- apache2-watcher.sh 31797 ? S 0:00 \_ /bin/sh -c sleep 45; /root/ lamp-apache2-watcher.sh
32064 ?        R      0:00              \_ ps axf


### note the absence of apache from the processlist!!

I'm so glad to be running rails and lighttpd now. I have never had a moment's trouble with a crashing rails app - they all have insanely detailed backtraces - and lighty simply never goes down - it also starts up in under a second.

Apache2 has turned into a big ugly pig. :(

_a



--
alex black, founder
the turing studio, inc.

510.666.0074
address@hidden
http://www.turingstudio.com

2600 10th street, suite 635
berkeley, ca 94710





--
To unsubscribe:
http://lists.nongnu.org/mailman/listinfo/monit-general







reply via email to

[Prev in Thread] Current Thread [Next in Thread]