freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] Decoding ram errors on supermicro


From: Albert Chu
Subject: Re: [Freeipmi-users] Decoding ram errors on supermicro
Date: Wed, 12 Dec 2018 11:52:17 -0800

Hey Tom,

So are you under the impression all the motherboards in your product ID
list should support this DIMM interpretation?

We atleast have evidence of X10DRH LN4 working based on your prior e-
mail, but I am a tad reluctant to add all motherboards, especially non-
X10 motherboards since we do not have official information from
Supermicro.

What are your thoughts?

Al

On Wed, 2018-12-12 at 12:04 +0100, Tom Hetmer wrote:
> Hi,
> 
> no luck.
> 
> 201  | Sep-22-2018 | 00:23:34 | Sensor #0        | Memory           
> | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> Data3 code = 80h
> 202  | Sep-29-2018 | 09:31:25 | Sensor #0        | Memory           
> | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> Data3 code = 80h
> 203  | Oct-13-2018 | 19:31:34 | Sensor #0        | Memory           
> | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> Data3 code = 80h
> 204  | Oct-20-2018 | 01:49:38 | Sensor #0        | Memory           
> | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> Data3 code = 80h
> 
> debug: http://termbin.com/3x02
> 10.110.32.36: [             811h] = product_id[16b]
> It seems that X10SLM-F (not X10SLM+-F) uses 2065 instead of 2051.
> You can check it in the full list:
> https://github.com/chu11/freeipmi-mirror/files/2651093/product_ids.tx
> t
> 
> When patched with 2065:
> 201,Sep-22-2018,00:23:34,Sensor #0,Memory,Warning,Correctable memory
> error ; DIMMB2(CPU1)
> 202,Sep-29-2018,09:31:25,Sensor #0,Memory,Warning,Correctable memory
> error ; DIMMB2(CPU1)
> 203,Oct-13-2018,19:31:34,Sensor #0,Memory,Warning,Correctable memory
> error ; DIMMB2(CPU1)
> 204,Oct-20-2018,01:49:38,Sensor #0,Memory,Warning,Correctable memory
> error ; DIMMB2(CPU1)
> 
> Voila :)
> 
> 
> Best,
> Tom Hetmer
> 
> CDN77 Operations
> address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> 
> ----- Původní zpráva -----
> > Odesilatel: "Albert Chu" <address@hidden>
> > Příjemce: "Tom Hetmer" <address@hidden>, address@hidden
> .org
> > Datum: 12/12/18 02:18
> > Předmět: Re: Re[4]: [Freeipmi-users] Decoding ram errors on
> supermicro
> >
> > Hey Tom,
> >
> > I got a branch on github with (what I hope) is support for the
> X10SLM+-
> > F.  Could you give it a shot.  The branch is called
> "supermicro_dimm".
> >
> > https://github.com/chu11/freeipmi-mirror/tree/supermicro_dimm
> >
> > ./autogen.sh
> > ./configure
> > make
> > ipmi-sel/ipmi-sel --interpret-oem-data
> > (add remote connection options as needed to ipmi-sel)
> >
> > If that doesn't work, could you do the following
> >
> > ipmi-sel/ipmi-sel --debug --display=201
> >
> > (i picked 201 as one of the DIMM output belows.  Doesn't have to be
> > that one, just any specific DIMM SEL event).
> >
> > Thanks,
> >
> > Al
> >
> > On Tue, 2018-12-11 at 13:33 +0100, Tom Hetmer wrote:
> > > Supermicro (after pointing me to web interface and SNMP...):
> > > "Sorry, we do not have this Information at our support desk. you
> can
> > > request this via your sales channel, but it can be that you would
> > > need to sign an NDA for such information."
> > >
> > > So we're on our own, I don't have any better contact as we buy
> from a
> > > reseller.
> > > Besides they'd want an NDA for that 3 lines of code.
> > >
> > > Best,
> > > Tom Hetmer
> > >
> > > CDN77 Operations
> > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > >
> > > ----- Původní zpráva -----
> > > Odesilatel: "Tom Hetmer" <address@hidden>
> > > Příjemce: "Al Chu" <address@hidden>, address@hidden
> > > Datum: 12/11/18 12:09
> > > Předmět: Re[3]: [Freeipmi-users] Decoding ram errors on
> supermicro
> > >
> > > Hey,
> > >
> > > so that was fast - we've got an older X10SLM-F rented by a
> customer.
> > >
> > > IPMI web says
> > > 201    2018/09/22 00:23:34    OEM    Memory    Correctable Memory
> ECC
> > > @ DIMMB2(CPU1)
> > > 202    2018/09/29 09:31:25    OEM    Memory    Correctable Memory
> ECC
> > > @ DIMMB2(CPU1)
> > > 203    2018/10/13 19:31:34    OEM    Memory    Correctable Memory
> ECC
> > > @ DIMMB2(CPU1)
> > > 204    2018/10/20 01:49:38    OEM    Memory    Correctable Memory
> ECC
> > > @ DIMMB2(CPU1)
> > >
> > > freeipmi:
> > > ID   | Date        | Time     | Name             | Type         
>    
> > > | State    | Event
> > > 7    | Jan-21-2016 | 15:26:16 | FANA             | Fan           
>  
> > >  | Critical | Lower Critical - going low ; Sensor Reading = 0.00
> RPM
> > > ; Threshold = 600.00 RPM
> > > 8    | Jan-21-2016 | 15:26:16 | FANA             | Fan           
>  
> > >  | Critical | Lower Non-recoverable - going low ; Sensor Reading
> =
> > > 0.00 RPM ; Threshold = 400.00 RPM
> > > 9    | Jan-21-2016 | 15:26:25 | FANA             | Fan           
>  
> > >  | Critical | Lower Non-recoverable - going low ; Sensor Reading
> =
> > > 13300.00 RPM ; Threshold = 400.00 RPM
> > > 10   | Jan-21-2016 | 15:26:25 | FANA             | Fan           
>  
> > >  | Warning  | Lower Critical - going low ; Sensor Reading =
> 13300.00
> > > RPM ; Threshold = 600.00 RPM
> > > 201  | Sep-22-2018 | 00:23:34 | Sensor #0        | Memory       
>    
> > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> 2Bh ;
> > > OEM Event Data3 code = 80h
> > > 202  | Sep-29-2018 | 09:31:25 | Sensor #0        | Memory       
>    
> > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> 2Bh ;
> > > OEM Event Data3 code = 80h
> > > 203  | Oct-13-2018 | 19:31:34 | Sensor #0        | Memory       
>    
> > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> 2Bh ;
> > > OEM Event Data3 code = 80h
> > > 204  | Oct-20-2018 | 01:49:38 | Sensor #0        | Memory       
>    
> > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> 2Bh ;
> > > OEM Event Data3 code = 80h
> > >
> > > We'll ask the customer for downtime to replace it, all should
> then be
> > > correct as it's official data from supermicro's own interface.
> > >
> > > Best,
> > > Tom Hetmer
> > >
> > > CDN77 Operations
> > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > >
> > > ----- Původní zpráva -----
> > > Odesilatel: "Tom Hetmer" <address@hidden>
> > > Příjemce: address@hidden, "Al Chu" <address@hidden>
> > > Datum: 12/11/18 11:59
> > > Předmět: Re[2]: [Freeipmi-users] Decoding ram errors on
> supermicro
> > >
> > > Hi,
> > >
> > > it appears we have no ECC errors on the servers we directly own
> right
> > > now.
> > > I can let you know when we get one though.
> > >
> > > We rent out some machines to customers as well, maybe there's
> some
> > > errors there => my colleague will check the report today.
> > >
> > > I also created a ticket with Supermicro just if they can confirm
> > > we're looking at the right code/add any official details.
> > >
> > > Best,
> > > Tom Hetmer
> > >
> > > CDN77 Operations
> > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > >
> > > ----- Původní zpráva -----
> > > > Odesilatel: "Al Chu" <address@hidden>
> > > > Příjemce: "Tom Hetmer" <address@hidden>, freeipmi-users
> @gnu
> > > .org
> > > > Datum: 12/11/18 02:28
> > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on supermicro
> > > >
> > > > Hey Tom,
> > > >
> > > > Is there a specific motherboard (amongst the product IDs you
> > > mentioned
> > > > below) you have with a dimm error that we can test on.  To make
> > > sure I
> > > > don't make a major mistake, I'd like to code to 1 motherboard
> > > first.
> > > >
> > > > Thanks,
> > > > Al
> > > >
> > > >
> > > > On Wed, 2018-12-05 at 10:48 -0800, Albert Chu wrote:
> > > > > On Wed, 2018-12-05 at 03:38 +0100, Tom Hetmer wrote:
> > > > > > Alright, added to github.
> > > > > >
> > > > > > Here's the output from bmc-info for that particular board.
> > > > > > Product ID            : 2201
> > > > > > [Mon Dec  3 12:08:13 2018] DMI: Supermicro X10DRH
> LN4/X10DRH-
> > > CLN4,
> > > > > > BIOS 2.0 01/30/2016
> > > > > >
> > > > > >
> > > > > > I guess you'll support it based on the product ID?
> > > > >
> > > > > Yes!  Thanks.  I'll put these in the ticket too.
> > > > >
> > > > > Al
> > > > >
> > > > > > So if there are any other (X10) boards with different
> product
> > > ID
> > > > > > but
> > > > > > the same SEL output I'll have to send it again, correct?
> > > > > >
> > > > > >
> > > > > > I have all kinds of numbers on other machines,
> > > > > > ie. 
> > > > > > X10DRW-E => 2148
> > > > > > X11SPi-TF => 2369
> > > > > > X10SLL-F => 2049
> > > > > > X10DRL-i => 2097
> > > > > > X11DDW-NT => 2407
> > > > > > X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F => 2051
> > > > > >
> > > > > >
> > > > > > and so on.. I think we have at least 1/4 of the boards they
> > > > > > manufacture.
> > > > > > X9s are under 2000, X11 seems to be 23xx. But that's maybe
> too
> > > much
> > > > > > reverse engineering to you ;)
> > > > > > I can try to ping them and ask about details but I got no
> > > offical
> > > > > > contact with Supermicro.
> > > > > >
> > > > > >
> > > > > > Best,
> > > > > > Tom Hetmer
> > > > > >
> > > > > >
> > > > > > CDN77 Operations
> > > > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > > > >
> > > > > > ----- Původní zpráva ----- 
> > > > > > > Odesilatel: "Albert Chu" <address@hidden> 
> > > > > > > Příjemce: "Tom Hetmer" <address@hidden>,
> freeipmi-use
> > > address@hidden
> > > > > > > nu
> > > > > > > .org 
> > > > > > > Datum: 12/04/18 19:40 
> > > > > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on
> > > supermicro 
> > > > > > >
> > > > > > > On Tue, 2018-12-04 at 11:39 +0100, Tom Hetmer wrote:
> > > > > > > > Sure. It seems there's a similar ticket
> > > > > > > > already: https://github.com/chu11/freeipmi-mirror/issue
> s/19
> > > > > > >
> > > > > > > Ahh, if you could, update it with info from ipmitool /
> > > ipmiutil.
> > > > > > >  I
> > > > > > > was
> > > > > > > reluctant to add support based on reverse engineering.
>  But
> > > if
> > > > > > > other
> > > > > > > tools have "official" interpretations from Supermicro,
> I'm
> > > more
> > > > > > > confident in the addition.
> > > > > > >
> > > > > > > > Yep, that's the code. ipmitool and a few others decode
> it
> > > too.
> > > > > > > >
> > > > > > > >
> > > > > > > > We have a *lot* of Supermicros so I can help with
> testing
> > > if
> > > > > > > > needed -
> > > > > > > > but we don't get that much CRC errors though :)
> > > > > > >
> > > > > > > The one thing I'll need is product ID numbers (you can
> get
> > > from
> > > > > > > bmc-
> > > > > > > info) and the name of the product.  This goes into the
> > > > > > > documentation
> > > > > > > and some of the code.
> > > > > > >
> > > > > > > Thanks,
> > > > > > >
> > > > > > > Al
> > > > > > >
> > > > > > > > So I guess we'd have to wait till one pops up. But I
> hope
> > > the
> > > > > > > > 'ver 2'
> > > > > > > > method from ipmiutil works fine.
> > > > > > > > We used ipmitool in our monitoring before and it was
> > > accurate
> > > > > > > > but
> > > > > > > > slow, that's why I rewrote it all to use freeipmi.
> > > > > > > >
> > > > > > > >
> > > > > > > > Thanks!
> > > > > > > >
> > > > > > > >
> > > > > > > > Best,
> > > > > > > > Tom Hetmer
> > > > > > > >
> > > > > > > >
> > > > > > > > CDN77 Operations
> > > > > > > > address@hidden / +44 (0) 20 3514 2399 /
> www.cdn77.com
> > > > > > > >
> > > > > > > > ----- Původní zpráva ----- 
> > > > > > > > > Odesilatel: "Albert Chu" <address@hidden> 
> > > > > > > > > Příjemce: "Tom Hetmer" <address@hidden>,
> > > freeipmi-
> > > > > > > > > users
> > > > > > > > > @gnu
> > > > > > > > > .org 
> > > > > > > > > Datum: 12/03/18 21:06 
> > > > > > > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on
> > > > > > > > > supermicro 
> > > > > > > > >
> > > > > > > > > Hi Tom,
> > > > > > > > >
> > > > > > > > > Thanks for the pointer to ipmiutil's code.  I assume
> you
> > > > > > > > > found
> > > > > > > > > this
> > > > > > > > > comment:
> > > > > > > > >
> > > > > > > > > ---
> > > > > > > > >       /* ver 2 method: 2A 80 = P1_DIMMB1
> > > > > > > > >
> > > */                                                           
> > > > > > > > >   
> > > > > > > > >     
> > > > > > > > >                            
> > > > > > > > >           /* SuperMicro
> > > > > > > > >
> > > says:                                                        
> > > > > > > > >   
> > > > > > > > >     
> > > > > > > > >                                             
> > > > > > > > >            *  pair: %c (data2 >> 4) + 0x40 + (data3 &
> > > 0x3) *
> > > > > > > > > 3,
> > > > > > > > >
> > > (='B')                                                       
> > > > > > > > >   
> > > > > > > > >     
> > > > > > > > >     
> > > > > > > > >            *  dimm: %c (data2 & 0xf) +
> > > > > > > > >
> > > 0x27,                                                        
> > > > > > > > >   
> > > > > > > > >     
> > > > > > > > >                              
> > > > > > > > >            *  cpu:  %x (data3 & 0x03) +
> > > > > > > > >
> > > 1);                                                          
> > > > > > > > >   
> > > > > > > > >     
> > > > > > > > >                             
> > > > > > > > >            */                       
> > > > > > > > > ---
> > > > > > > > >
> > > > > > > > > I can definitely add it to my todo list.
> > > > > > > > >
> > > > > > > > > Would you mind writing up an issue on github here?
> > > > > > > > >
> > > > > > > > > https://github.com/chu11/freeipmi-mirror
> > > > > > > > >
> > > > > > > > > Al
> > > > > > > > >
> > > > > > > > > On Mon, 2018-12-03 at 17:55 +0100, Tom Hetmer wrote:
> > > > > > > > > > Hi, 
> > > > > > > > > >
> > > > > > > > > > it'd be good if freeipmi supported decoding the
> > > supermicro
> > > > > > > > > > ECC
> > > > > > > > > > errors.
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > Manufacturer: Supermicro
> > > > > > > > > > Product Name: X10DRH LN4
> > > > > > > > > > eg.
> > > > > > > > > > freeipmi
> > > > > > > > > > 1,Dec-01-2018,06:37:53,Sensor
> > > > > > > > > > #0,Memory,Critical,Uncorrectable
> > > > > > > > > > memory
> > > > > > > > > > error ; OEM Event Data2 code = 3Ah ; OEM Event
> Data3
> > > code =
> > > > > > > > > > 81h
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > web interface
> > > > > > > > > > 1 | 12/01/2018 | 06:37:53 | Memory | Uncorrectable
> ECC
> > > > > > > > > > (@DIMMG1(CPU2)) | Asserted
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > something like this worked for me (stolen from
> > > ipmiutil)
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $cpu = ($data3 & 0x03) + 1;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $NPAIRS = 26;
> > > > > > > > > > $rgpairs = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $bdata = "0x".$data2.$data3;
> > > > > > > > > > $bdata = hexdec($bdata);
> > > > > > > > > > $pair = (($bdata & 0xF0) >> 4) - 1;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > if ($pair < 0) $pair = 0;
> > > > > > > > > > if ($pair > $NPAIRS) $pair = $NPAIRS - 1;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $pair = $rgpairs[$pair - 1];
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $dimm = $bdata & 0x0F;
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > $dimm may be incorrect as the original code
> decrements
> > > 9,
> > > > > > > > > > but
> > > > > > > > > > on
> > > > > > > > > > that
> > > > > > > > > > board it was wrong so i changed it to get the right
> > > result
> > > > > > > > > > -
> > > > > > > > > > we'll
> > > > > > > > > > see if it keeps getting the right values.
> > > > > > > > > >
> > > > > > > > > > Best,
> > > > > > > > > > Tom Hetmer
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > CDN77 Operations
> > > > > > > > > > address@hidden / +44 (0) 20 3514 2399 /
> > > www.cdn77.com
> > > > > > > > > >
> > > > > > > > > > _______________________________________________
> > > > > > > > > > Freeipmi-users mailing list
> > > > > > > > > > address@hidden
> > > > > > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-use
> rs
> > > > > > > > >
> > > > > > > > > -- 
> > > > > > > > > Albert Chu
> > > > > > > > > address@hidden
> > > > > > > > > Computer Scientist
> > > > > > > > > High Performance Systems Division
> > > > > > > > > Lawrence Livermore National Laboratory
> > > > > > > >
> > > > > > > > _______________________________________________
> > > > > > > > Freeipmi-users mailing list
> > > > > > > > address@hidden
> > > > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-users
> > > > > > >
> > > > > > > -- 
> > > > > > > Albert Chu
> > > > > > > address@hidden
> > > > > > > Computer Scientist
> > > > > > > High Performance Systems Division
> > > > > > > Lawrence Livermore National Laboratory
> > > > > >
> > > > > > _______________________________________________
> > > > > > Freeipmi-users mailing list
> > > > > > address@hidden
> > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-users
> > --
> > Albert Chu
> > address@hidden
> > Computer Scientist
> > High Performance Systems Division
> > Lawrence Livermore National Laboratory
-- 
Albert Chu
address@hidden
Computer Scientist
High Performance Systems Division
Lawrence Livermore National Laboratory




reply via email to

[Prev in Thread] Current Thread [Next in Thread]