freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] Decoding ram errors on supermicro


From: Tom Hetmer
Subject: Re: [Freeipmi-users] Decoding ram errors on supermicro
Date: Tue, 11 Dec 2018 13:33:12 +0100

Supermicro (after pointing me to web interface and SNMP...):
"Sorry, we do not have this Information at our support desk. you can request 
this via your sales channel, but it can be that you would need to sign an NDA 
for such information."


So we're on our own, I don't have any better contact as we buy from a reseller.
Besides they'd want an NDA for that 3 lines of code.


Best,
Tom Hetmer


CDN77 Operations
address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com


----- Původní zpráva -----
Odesilatel: "Tom Hetmer" <address@hidden>
Příjemce: "Al Chu" <address@hidden>, address@hidden
Datum: 12/11/18 12:09
Předmět: Re[3]: [Freeipmi-users] Decoding ram errors on supermicro

Hey,

so that was fast - we've got an older X10SLM-F rented by a customer.


IPMI web says
201    2018/09/22 00:23:34    OEM    Memory    Correctable Memory ECC @ 
DIMMB2(CPU1)
202    2018/09/29 09:31:25    OEM    Memory    Correctable Memory ECC @ 
DIMMB2(CPU1)
203    2018/10/13 19:31:34    OEM    Memory    Correctable Memory ECC @ 
DIMMB2(CPU1)
204    2018/10/20 01:49:38    OEM    Memory    Correctable Memory ECC @ 
DIMMB2(CPU1)


freeipmi:
ID   | Date        | Time     | Name             | Type              | State    
| Event
7    | Jan-21-2016 | 15:26:16 | FANA             | Fan               | Critical 
| Lower Critical - going low ; Sensor Reading = 0.00 RPM ; Threshold = 600.00 
RPM
8    | Jan-21-2016 | 15:26:16 | FANA             | Fan               | Critical 
| Lower Non-recoverable - going low ; Sensor Reading = 0.00 RPM ; Threshold = 
400.00 RPM
9    | Jan-21-2016 | 15:26:25 | FANA             | Fan               | Critical 
| Lower Non-recoverable - going low ; Sensor Reading = 13300.00 RPM ; Threshold 
= 400.00 RPM
10   | Jan-21-2016 | 15:26:25 | FANA             | Fan               | Warning  
| Lower Critical - going low ; Sensor Reading = 13300.00 RPM ; Threshold = 
600.00 RPM
201  | Sep-22-2018 | 00:23:34 | Sensor #0        | Memory            | Warning  
| Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event Data3 code 
= 80h
202  | Sep-29-2018 | 09:31:25 | Sensor #0        | Memory            | Warning  
| Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event Data3 code 
= 80h
203  | Oct-13-2018 | 19:31:34 | Sensor #0        | Memory            | Warning  
| Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event Data3 code 
= 80h
204  | Oct-20-2018 | 01:49:38 | Sensor #0        | Memory            | Warning  
| Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event Data3 code 
= 80h


We'll ask the customer for downtime to replace it, all should then be correct 
as it's official data from supermicro's own interface.

Best,
Tom Hetmer


CDN77 Operations
address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com


----- Původní zpráva -----
Odesilatel: "Tom Hetmer" <address@hidden>
Příjemce: address@hidden, "Al Chu" <address@hidden>
Datum: 12/11/18 11:59
Předmět: Re[2]: [Freeipmi-users] Decoding ram errors on supermicro

Hi,

it appears we have no ECC errors on the servers we directly own right now.
I can let you know when we get one though.


We rent out some machines to customers as well, maybe there's some errors there 
=> my colleague will check the report today.


I also created a ticket with Supermicro just if they can confirm we're looking 
at the right code/add any official details.


Best,
Tom Hetmer


CDN77 Operations
address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com

----- Původní zpráva -----
> Odesilatel: "Al Chu" <address@hidden>
> Příjemce: "Tom Hetmer" <address@hidden>, address@hidden
> Datum: 12/11/18 02:28
> Předmět: Re: [Freeipmi-users] Decoding ram errors on supermicro
>
> Hey Tom,
>
> Is there a specific motherboard (amongst the product IDs you mentioned
> below) you have with a dimm error that we can test on.  To make sure I
> don't make a major mistake, I'd like to code to 1 motherboard first.
>
> Thanks,
> Al
>
>
> On Wed, 2018-12-05 at 10:48 -0800, Albert Chu wrote:
> > On Wed, 2018-12-05 at 03:38 +0100, Tom Hetmer wrote:
> > > Alright, added to github.
> > >
> > > Here's the output from bmc-info for that particular board.
> > > Product ID            : 2201
> > > [Mon Dec  3 12:08:13 2018] DMI: Supermicro X10DRH LN4/X10DRH-CLN4,
> > > BIOS 2.0 01/30/2016
> > >
> > >
> > > I guess you'll support it based on the product ID?
> >
> > Yes!  Thanks.  I'll put these in the ticket too.
> >
> > Al
> >
> > > So if there are any other (X10) boards with different product ID
> > > but
> > > the same SEL output I'll have to send it again, correct?
> > >
> > >
> > > I have all kinds of numbers on other machines,
> > > ie. 
> > > X10DRW-E => 2148
> > > X11SPi-TF => 2369
> > > X10SLL-F => 2049
> > > X10DRL-i => 2097
> > > X11DDW-NT => 2407
> > > X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F => 2051
> > >
> > >
> > > and so on.. I think we have at least 1/4 of the boards they
> > > manufacture.
> > > X9s are under 2000, X11 seems to be 23xx. But that's maybe too much
> > > reverse engineering to you ;)
> > > I can try to ping them and ask about details but I got no offical
> > > contact with Supermicro.
> > >
> > >
> > > Best,
> > > Tom Hetmer
> > >
> > >
> > > CDN77 Operations
> > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > >
> > > ----- Původní zpráva ----- 
> > > > Odesilatel: "Albert Chu" <address@hidden> 
> > > > Příjemce: "Tom Hetmer" <address@hidden>, address@hidden
> > > > nu
> > > > .org 
> > > > Datum: 12/04/18 19:40 
> > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on supermicro 
> > > >
> > > > On Tue, 2018-12-04 at 11:39 +0100, Tom Hetmer wrote:
> > > > > Sure. It seems there's a similar ticket
> > > > > already: https://github.com/chu11/freeipmi-mirror/issues/19
> > > >
> > > > Ahh, if you could, update it with info from ipmitool / ipmiutil.
> > > >  I
> > > > was
> > > > reluctant to add support based on reverse engineering.  But if
> > > > other
> > > > tools have "official" interpretations from Supermicro, I'm more
> > > > confident in the addition.
> > > >
> > > > > Yep, that's the code. ipmitool and a few others decode it too.
> > > > >
> > > > >
> > > > > We have a *lot* of Supermicros so I can help with testing if
> > > > > needed -
> > > > > but we don't get that much CRC errors though :)
> > > >
> > > > The one thing I'll need is product ID numbers (you can get from
> > > > bmc-
> > > > info) and the name of the product.  This goes into the
> > > > documentation
> > > > and some of the code.
> > > >
> > > > Thanks,
> > > >
> > > > Al
> > > >
> > > > > So I guess we'd have to wait till one pops up. But I hope the
> > > > > 'ver 2'
> > > > > method from ipmiutil works fine.
> > > > > We used ipmitool in our monitoring before and it was accurate
> > > > > but
> > > > > slow, that's why I rewrote it all to use freeipmi.
> > > > >
> > > > >
> > > > > Thanks!
> > > > >
> > > > >
> > > > > Best,
> > > > > Tom Hetmer
> > > > >
> > > > >
> > > > > CDN77 Operations
> > > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > > >
> > > > > ----- Původní zpráva ----- 
> > > > > > Odesilatel: "Albert Chu" <address@hidden> 
> > > > > > Příjemce: "Tom Hetmer" <address@hidden>, freeipmi-
> > > > > > users
> > > > > > @gnu
> > > > > > .org 
> > > > > > Datum: 12/03/18 21:06 
> > > > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on
> > > > > > supermicro 
> > > > > >
> > > > > > Hi Tom,
> > > > > >
> > > > > > Thanks for the pointer to ipmiutil's code.  I assume you
> > > > > > found
> > > > > > this
> > > > > > comment:
> > > > > >
> > > > > > ---
> > > > > >       /* ver 2 method: 2A 80 = P1_DIMMB1
> > > > > > */                                                           
> > > > > >   
> > > > > >     
> > > > > >                            
> > > > > >           /* SuperMicro
> > > > > > says:                                                        
> > > > > >   
> > > > > >     
> > > > > >                                             
> > > > > >            *  pair: %c (data2 >> 4) + 0x40 + (data3 & 0x3) *
> > > > > > 3,
> > > > > > (='B')                                                       
> > > > > >   
> > > > > >     
> > > > > >     
> > > > > >            *  dimm: %c (data2 & 0xf) +
> > > > > > 0x27,                                                        
> > > > > >   
> > > > > >     
> > > > > >                              
> > > > > >            *  cpu:  %x (data3 & 0x03) +
> > > > > > 1);                                                          
> > > > > >   
> > > > > >     
> > > > > >                             
> > > > > >            */                       
> > > > > > ---
> > > > > >
> > > > > > I can definitely add it to my todo list.
> > > > > >
> > > > > > Would you mind writing up an issue on github here?
> > > > > >
> > > > > > https://github.com/chu11/freeipmi-mirror
> > > > > >
> > > > > > Al
> > > > > >
> > > > > > On Mon, 2018-12-03 at 17:55 +0100, Tom Hetmer wrote:
> > > > > > > Hi, 
> > > > > > >
> > > > > > > it'd be good if freeipmi supported decoding the supermicro
> > > > > > > ECC
> > > > > > > errors.
> > > > > > >
> > > > > > >
> > > > > > > Manufacturer: Supermicro
> > > > > > > Product Name: X10DRH LN4
> > > > > > > eg.
> > > > > > > freeipmi
> > > > > > > 1,Dec-01-2018,06:37:53,Sensor
> > > > > > > #0,Memory,Critical,Uncorrectable
> > > > > > > memory
> > > > > > > error ; OEM Event Data2 code = 3Ah ; OEM Event Data3 code =
> > > > > > > 81h
> > > > > > >
> > > > > > >
> > > > > > > web interface
> > > > > > > 1 | 12/01/2018 | 06:37:53 | Memory | Uncorrectable ECC
> > > > > > > (@DIMMG1(CPU2)) | Asserted
> > > > > > >
> > > > > > >
> > > > > > > something like this worked for me (stolen from ipmiutil)
> > > > > > >
> > > > > > >
> > > > > > > $cpu = ($data3 & 0x03) + 1;
> > > > > > >
> > > > > > >
> > > > > > > $NPAIRS = 26;
> > > > > > > $rgpairs = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
> > > > > > >
> > > > > > >
> > > > > > > $bdata = "0x".$data2.$data3;
> > > > > > > $bdata = hexdec($bdata);
> > > > > > > $pair = (($bdata & 0xF0) >> 4) - 1;
> > > > > > >
> > > > > > >
> > > > > > > if ($pair < 0) $pair = 0;
> > > > > > > if ($pair > $NPAIRS) $pair = $NPAIRS - 1;
> > > > > > >
> > > > > > >
> > > > > > > $pair = $rgpairs[$pair - 1];
> > > > > > >
> > > > > > >
> > > > > > > $dimm = $bdata & 0x0F;
> > > > > > >
> > > > > > >
> > > > > > > $dimm may be incorrect as the original code decrements 9,
> > > > > > > but
> > > > > > > on
> > > > > > > that
> > > > > > > board it was wrong so i changed it to get the right result
> > > > > > > -
> > > > > > > we'll
> > > > > > > see if it keeps getting the right values.
> > > > > > >
> > > > > > > Best,
> > > > > > > Tom Hetmer
> > > > > > >
> > > > > > >
> > > > > > > CDN77 Operations
> > > > > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Freeipmi-users mailing list
> > > > > > > address@hidden
> > > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-users
> > > > > >
> > > > > > -- 
> > > > > > Albert Chu
> > > > > > address@hidden
> > > > > > Computer Scientist
> > > > > > High Performance Systems Division
> > > > > > Lawrence Livermore National Laboratory
> > > > >
> > > > > _______________________________________________
> > > > > Freeipmi-users mailing list
> > > > > address@hidden
> > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-users
> > > >
> > > > -- 
> > > > Albert Chu
> > > > address@hidden
> > > > Computer Scientist
> > > > High Performance Systems Division
> > > > Lawrence Livermore National Laboratory
> > >
> > > _______________________________________________
> > > Freeipmi-users mailing list
> > > address@hidden
> > > https://lists.gnu.org/mailman/listinfo/freeipmi-users





reply via email to

[Prev in Thread] Current Thread [Next in Thread]