[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Freeipmi-devel] Re: Another FreeIPMI beta w/ BMC watchdog workaround fo

From: Frank Steiner
Subject: [Freeipmi-devel] Re: Another FreeIPMI beta w/ BMC watchdog workaround for Sun machines
Date: Mon, 05 Jul 2010 08:52:57 +0200
User-agent: Thunderbird (X11/20100302)

Hi Al,

Albert Chu wrote

> Hey Dave, Frank,
> As discussed in the previous thread, there was a corner case in the
> bmc-watchdog workaround I previously did.  I then discovered another
> corner case w/ the workaround.
> There is a new beta here.

sorry, I was away, but I'm going to test the new beta now. During my
absense the Sun X4100M2 produced two strange things:

1) bmc-watchdog: Get Watchdog Timer Error: No error message found for 
   command 25h, network function 06h, and completion code 80h.  Please 
   report to <address@hidden>

2) The really bad thing was three of the X4100M2 being rebooted by the
   watchdog as reaction to a "bmc-watchdog -s -k" call I guess. The
   timer runs 15 minutes and I reset the watchdog by to independent
   instances  every 3 minutes. On all three machines I found this in
   the logs:

   Jul  3 21:03:01 sunserver8 /usr/sbin/cron[11808]: (root) CMD 
   Jul  3 21:03:04 sunserver8 pm-profiler: Power Button pressed, executing 
/sbin/shutdown -h now
   Jul  3 21:03:04 sunserver8 shutdown[11853]: shutting down for system halt

   The bmc-reset script just does this:
   for name in `seq 1 15`
     # -s -k means: reset if running. Could be that the timer was
     # stopped because the init script failed to set it up. We should
     # not start it then.
     output=`/usr/sbin/bmc-watchdog -s -k 2>&1`
     if [ "$exitstatus" != "0" ]
       sleep 3
       exit 0

   There was always 2-3 seconds between the cron entry and the shutdown
   so I guess the ilom of the Sun initiated the shutdown due to the
   bmc-watchdog -s -k command. The timer cannot have run down because
   I get an email for every failed try to reset the watchdog and should
   have gotten 3-4 of them in the 15 minutes the timer runs.

   Has anything liks this reported before?

Btw, Sun first refused to develop a firmware update for the X4100M2 because
it is EOL, but due to our 5-year-support warranty they are forced to do so ;-)
Now they are developing a patch for a newer machine, because they stated that
the error exists in may of the SunFire machines, and will then backport it to
the 4100.


Dipl.-Inform. Frank Steiner   Web:
Lehrstuhl f. Bioinformatik    Mail:
LMU, Amalienstr. 17           Phone: +49 89 2180-4049
80333 Muenchen, Germany       Fax:   +49 89 2180-99-4049
* Rekursion kann man erst verstehen, wenn man Rekursion verstanden hat. *

reply via email to

[Prev in Thread] Current Thread [Next in Thread]