[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: uptime weirdness
From: |
Martin Pala |
Subject: |
Re: uptime weirdness |
Date: |
Thu, 19 Aug 2010 13:20:10 +0200 |
The next monit version (5.2) supports monitoring without pidfiles by
specification of process pattern which is compared with running processes. This
allows to watch processes without pidfiles and also don't depend on pidfile
content.
Changelog excerpt:
--8<--
* Added support for monitoring processes without pidfile using pattern matching.
You can use POSIX regular expressions or full strings maching process name.
The process string corresponds to output of 'ps' utility. Some platforms
like Mac OS X require super-user privileges to get full process name.
The first match is used so this form of check is useful for unique
pattern matching - the pidfile should be used where possible as it defines
expected pid exactly (pattern matching won't be useful for Apache for
example).
Example usage (monitoring VMware virtual machine):
check process vmware-debian matching "/usr/lib/vmware/bin/vmware-vmx
.*debian4-x86.vmx"
...
--8<--
On Aug 19, 2010, at 2:56 AM, Gareth Pye wrote:
> Sorry for not replying earlier, your response had cleared up things for me.
>
> Until today when it struck me how much of a huge bug this is. If a system is
> power cycled (no normal shutdown procedure) so that the old pid files still
> exist and some other random process is running with that pid then the task
> that monit is meant to be monitoring will never be started.
>
> The case I had just a few minutes ago was that the pid file ended up pointing
> to monit it self.
>
> Obviously the simple hack is to remove all pid files before starting monit
> (or at least at some point in the boot procedure but before monit has started
> the processes seams most efficient). Wouldn't it make sense for monit to
> ensure that the pid files aren't older than the current system uptime?
> Obviously a process can't have been running longer than the host system.
>
> Gareth Pye
> Engineer
> GPSat Systems Australia
> address@hidden
> Ph: 03 9455 0041
> Fax: 03 9455 0042
>
>
> On 12/08/10 20:56, Martin Pala wrote:
>> I'm not sure what system uptime in your case is - the attached monit status
>> output contains following uptimes only:
>>
>> 1.) monit uptime: 49m => monit was started 49 minutes ago (system itself
>> may be running much longer - this uptime is updated whenever monit itself is
>> (re)started)
>> 2.) process 'BoomDataToMODBUS' uptime: 45m
>> 3.) process 'DataRouter' uptime: 21h9m
>>
>> => if the system was started less then 21h9m ago at the point when monit
>> status was taken, then the reported uptime of DataRouter process is wrong.
>> With monit-5.0.3 it could happen because it's based on the pidfile's
>> timestamp. The next monit release (5.2) fixes this problem. Monit-5.2
>> changelog excerpt:
>>
>> --8<--
>> * Show real process uptime - formerly the presented uptime was based on
>> create/modify
>> timestamp of process' pidfile which provides invalid uptime if the pidfile
>> is
>> replaced and process keeps running with original PID (such as on apache
>> reload).
>> Thanks to Nima Chavooshi for report.
>> --8<--
>>
>> Regards,
>> Martin
>>
>>
>>
>> On Aug 12, 2010, at 2:01 AM, Gareth Pye wrote:
>>
>>
>>> I've just noticed that the uptime for one of my processes as reported by
>>> monit is greater than the system time. Is this plausible?
>>>
>>> The Monit daemon 5.0.3 uptime: 49m
>>>
>>> Process 'BoomDataToMODBUS'
>>> status running
>>> monitoring status monitored
>>> pid 909
>>> parent pid 1
>>> uptime 45m
>>> children 0
>>> memory kilobytes 1880
>>> memory kilobytes total 1880
>>> memory percent 1.4%
>>> memory percent total 1.4%
>>> cpu percent 0.0%
>>> cpu percent total 0.0%
>>> data collected Wed Aug 11 16:40:27 2010
>>>
>>> Process 'DataRouter'
>>> status running
>>> monitoring status monitored
>>> pid 901
>>> parent pid 886
>>> uptime 21h 9m
>>> monitoring status monitored
>>> pid 901
>>> parent pid 886
>>> uptime 21h 9m
>>> children 0
>>> memory kilobytes 3232
>>> memory kilobytes total 3232
>>> memory percent 2.5%
>>> memory percent total 2.5%
>>> cpu percent 0.0%
>>> cpu percent total 0.0%
>>> data collected Wed Aug 11 16:40:27 2010
>>>
>>> File 'user.config'
>>> status accessible
>>> monitoring status monitored
>>> permission 644
>>> uid 0
>>> gid 0
>>> timestamp Wed Aug 11 15:50:56 2010
>>> size 1295 B
>>> checksum deeaffe3f625e93f00aeead0a0a3abd5(MD5)
>>> data collected Wed Aug 11 16:40:27 2010
>>>
>>> Filesystem 'root'
>>> status accessible
>>> monitoring status monitored
>>> permission 755
>>> uid 0
>>> gid 0
>>> filesystem flags 0
>>> block size 4096 B
>>> blocks total 120169 [469.4 MB]
>>> blocks free for non superuser 14286 [55.8 MB] [11.9%]
>>> blocks free total 14286 [55.8 MB] [11.9%]
>>> inodes total 134976
>>> inodes free 116759 [86.5%]
>>> data collected Wed Aug 11 16:40:27 2010
>>>
>>> System 'Test-Base'
>>> status running
>>> monitoring status monitored
>>> load average [0.00] [0.00] [0.00]
>>> cpu 0.0%us 0.1%sy 0.0%wa
>>> memory usage 12892 kB [10.1%]
>>> data collected Wed Aug 11 16:40:27 2010
>>>
>>> --
>>> Gareth Pye
>>> Engineer
>>> GPSat Systems Australia
>>> address@hidden
>>> Ph: 03 9455 0041
>>> Fax: 03 9455 0042
>>>
>>>
>>> --
>>> To unsubscribe:
>>> http://lists.nongnu.org/mailman/listinfo/monit-general
>>>
>>
>> --
>> To unsubscribe:
>> http://lists.nongnu.org/mailman/listinfo/monit-general
>>
>>
>
> --
> To unsubscribe:
> http://lists.nongnu.org/mailman/listinfo/monit-general