freeipmi-users
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Freeipmi-users] Decoding ram errors on supermicro


From: Tom Hetmer
Subject: Re: [Freeipmi-users] Decoding ram errors on supermicro
Date: Thu, 13 Dec 2018 12:21:56 +0100

I'm not sure if this or the 'version 1' method applies for X9s.
I've seen it work for all X10s at least. I think I'd limit it just to X10 
boards.

Best,
Tom Hetmer


CDN77 Operations
address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com

----- Původní zpráva ----- 
> Odesilatel: "Albert Chu" <address@hidden> 
> Příjemce: "Tom Hetmer" <address@hidden>, address@hidden 
> Datum: 12/12/18 20:52 
> Předmět: Re: Re[6]: [Freeipmi-users] Decoding ram errors on supermicro 
> 
> Hey Tom,
> 
> So are you under the impression all the motherboards in your product ID
> list should support this DIMM interpretation?
> 
> We atleast have evidence of X10DRH LN4 working based on your prior e-
> mail, but I am a tad reluctant to add all motherboards, especially non-
> X10 motherboards since we do not have official information from
> Supermicro.
> 
> What are your thoughts?
> 
> Al
> 
> On Wed, 2018-12-12 at 12:04 +0100, Tom Hetmer wrote:
> > Hi,
> > 
> > no luck.
> > 
> > 201  | Sep-22-2018 | 00:23:34 | Sensor #0        | Memory           
> > | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> > Data3 code = 80h
> > 202  | Sep-29-2018 | 09:31:25 | Sensor #0        | Memory           
> > | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> > Data3 code = 80h
> > 203  | Oct-13-2018 | 19:31:34 | Sensor #0        | Memory           
> > | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> > Data3 code = 80h
> > 204  | Oct-20-2018 | 01:49:38 | Sensor #0        | Memory           
> > | Correctable memory error ; OEM Event Data2 code = 2Bh ; OEM Event
> > Data3 code = 80h
> > 
> > debug: http://termbin.com/3x02
> > 10.110.32.36: [             811h] = product_id[16b]
> > It seems that X10SLM-F (not X10SLM+-F) uses 2065 instead of 2051.
> > You can check it in the full list:
> > https://github.com/chu11/freeipmi-mirror/files/2651093/product_ids.tx
> > t
> > 
> > When patched with 2065:
> > 201,Sep-22-2018,00:23:34,Sensor #0,Memory,Warning,Correctable memory
> > error ; DIMMB2(CPU1)
> > 202,Sep-29-2018,09:31:25,Sensor #0,Memory,Warning,Correctable memory
> > error ; DIMMB2(CPU1)
> > 203,Oct-13-2018,19:31:34,Sensor #0,Memory,Warning,Correctable memory
> > error ; DIMMB2(CPU1)
> > 204,Oct-20-2018,01:49:38,Sensor #0,Memory,Warning,Correctable memory
> > error ; DIMMB2(CPU1)
> > 
> > Voila :)
> > 
> > 
> > Best,
> > Tom Hetmer
> > 
> > CDN77 Operations
> > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > 
> > ----- Původní zpráva -----
> > > Odesilatel: "Albert Chu" <address@hidden>
> > > Příjemce: "Tom Hetmer" <address@hidden>, address@hidden
> > .org
> > > Datum: 12/12/18 02:18
> > > Předmět: Re: Re[4]: [Freeipmi-users] Decoding ram errors on
> > supermicro
> > >
> > > Hey Tom,
> > >
> > > I got a branch on github with (what I hope) is support for the
> > X10SLM+-
> > > F.  Could you give it a shot.  The branch is called
> > "supermicro_dimm".
> > >
> > > https://github.com/chu11/freeipmi-mirror/tree/supermicro_dimm
> > >
> > > ./autogen.sh
> > > ./configure
> > > make
> > > ipmi-sel/ipmi-sel --interpret-oem-data
> > > (add remote connection options as needed to ipmi-sel)
> > >
> > > If that doesn't work, could you do the following
> > >
> > > ipmi-sel/ipmi-sel --debug --display=201
> > >
> > > (i picked 201 as one of the DIMM output belows.  Doesn't have to be
> > > that one, just any specific DIMM SEL event).
> > >
> > > Thanks,
> > >
> > > Al
> > >
> > > On Tue, 2018-12-11 at 13:33 +0100, Tom Hetmer wrote:
> > > > Supermicro (after pointing me to web interface and SNMP...):
> > > > "Sorry, we do not have this Information at our support desk. you
> > can
> > > > request this via your sales channel, but it can be that you would
> > > > need to sign an NDA for such information."
> > > >
> > > > So we're on our own, I don't have any better contact as we buy
> > from a
> > > > reseller.
> > > > Besides they'd want an NDA for that 3 lines of code.
> > > >
> > > > Best,
> > > > Tom Hetmer
> > > >
> > > > CDN77 Operations
> > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > >
> > > > ----- Původní zpráva -----
> > > > Odesilatel: "Tom Hetmer" <address@hidden>
> > > > Příjemce: "Al Chu" <address@hidden>, address@hidden
> > > > Datum: 12/11/18 12:09
> > > > Předmět: Re[3]: [Freeipmi-users] Decoding ram errors on
> > supermicro
> > > >
> > > > Hey,
> > > >
> > > > so that was fast - we've got an older X10SLM-F rented by a
> > customer.
> > > >
> > > > IPMI web says
> > > > 201    2018/09/22 00:23:34    OEM    Memory    Correctable Memory
> > ECC
> > > > @ DIMMB2(CPU1)
> > > > 202    2018/09/29 09:31:25    OEM    Memory    Correctable Memory
> > ECC
> > > > @ DIMMB2(CPU1)
> > > > 203    2018/10/13 19:31:34    OEM    Memory    Correctable Memory
> > ECC
> > > > @ DIMMB2(CPU1)
> > > > 204    2018/10/20 01:49:38    OEM    Memory    Correctable Memory
> > ECC
> > > > @ DIMMB2(CPU1)
> > > >
> > > > freeipmi:
> > > > ID   | Date        | Time     | Name             | Type         
> >    
> > > > | State    | Event
> > > > 7    | Jan-21-2016 | 15:26:16 | FANA             | Fan           
> >  
> > > >  | Critical | Lower Critical - going low ; Sensor Reading = 0.00
> > RPM
> > > > ; Threshold = 600.00 RPM
> > > > 8    | Jan-21-2016 | 15:26:16 | FANA             | Fan           
> >  
> > > >  | Critical | Lower Non-recoverable - going low ; Sensor Reading
> > =
> > > > 0.00 RPM ; Threshold = 400.00 RPM
> > > > 9    | Jan-21-2016 | 15:26:25 | FANA             | Fan           
> >  
> > > >  | Critical | Lower Non-recoverable - going low ; Sensor Reading
> > =
> > > > 13300.00 RPM ; Threshold = 400.00 RPM
> > > > 10   | Jan-21-2016 | 15:26:25 | FANA             | Fan           
> >  
> > > >  | Warning  | Lower Critical - going low ; Sensor Reading =
> > 13300.00
> > > > RPM ; Threshold = 600.00 RPM
> > > > 201  | Sep-22-2018 | 00:23:34 | Sensor #0        | Memory       
> >    
> > > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> > 2Bh ;
> > > > OEM Event Data3 code = 80h
> > > > 202  | Sep-29-2018 | 09:31:25 | Sensor #0        | Memory       
> >    
> > > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> > 2Bh ;
> > > > OEM Event Data3 code = 80h
> > > > 203  | Oct-13-2018 | 19:31:34 | Sensor #0        | Memory       
> >    
> > > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> > 2Bh ;
> > > > OEM Event Data3 code = 80h
> > > > 204  | Oct-20-2018 | 01:49:38 | Sensor #0        | Memory       
> >    
> > > > | Warning  | Correctable memory error ; OEM Event Data2 code =
> > 2Bh ;
> > > > OEM Event Data3 code = 80h
> > > >
> > > > We'll ask the customer for downtime to replace it, all should
> > then be
> > > > correct as it's official data from supermicro's own interface.
> > > >
> > > > Best,
> > > > Tom Hetmer
> > > >
> > > > CDN77 Operations
> > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > >
> > > > ----- Původní zpráva -----
> > > > Odesilatel: "Tom Hetmer" <address@hidden>
> > > > Příjemce: address@hidden, "Al Chu" <address@hidden>
> > > > Datum: 12/11/18 11:59
> > > > Předmět: Re[2]: [Freeipmi-users] Decoding ram errors on
> > supermicro
> > > >
> > > > Hi,
> > > >
> > > > it appears we have no ECC errors on the servers we directly own
> > right
> > > > now.
> > > > I can let you know when we get one though.
> > > >
> > > > We rent out some machines to customers as well, maybe there's
> > some
> > > > errors there => my colleague will check the report today.
> > > >
> > > > I also created a ticket with Supermicro just if they can confirm
> > > > we're looking at the right code/add any official details.
> > > >
> > > > Best,
> > > > Tom Hetmer
> > > >
> > > > CDN77 Operations
> > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > >
> > > > ----- Původní zpráva -----
> > > > > Odesilatel: "Al Chu" <address@hidden>
> > > > > Příjemce: "Tom Hetmer" <address@hidden>, freeipmi-users
> > @gnu
> > > > .org
> > > > > Datum: 12/11/18 02:28
> > > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on supermicro
> > > > >
> > > > > Hey Tom,
> > > > >
> > > > > Is there a specific motherboard (amongst the product IDs you
> > > > mentioned
> > > > > below) you have with a dimm error that we can test on.  To make
> > > > sure I
> > > > > don't make a major mistake, I'd like to code to 1 motherboard
> > > > first.
> > > > >
> > > > > Thanks,
> > > > > Al
> > > > >
> > > > >
> > > > > On Wed, 2018-12-05 at 10:48 -0800, Albert Chu wrote:
> > > > > > On Wed, 2018-12-05 at 03:38 +0100, Tom Hetmer wrote:
> > > > > > > Alright, added to github.
> > > > > > >
> > > > > > > Here's the output from bmc-info for that particular board.
> > > > > > > Product ID            : 2201
> > > > > > > [Mon Dec  3 12:08:13 2018] DMI: Supermicro X10DRH
> > LN4/X10DRH-
> > > > CLN4,
> > > > > > > BIOS 2.0 01/30/2016
> > > > > > >
> > > > > > >
> > > > > > > I guess you'll support it based on the product ID?
> > > > > >
> > > > > > Yes!  Thanks.  I'll put these in the ticket too.
> > > > > >
> > > > > > Al
> > > > > >
> > > > > > > So if there are any other (X10) boards with different
> > product
> > > > ID
> > > > > > > but
> > > > > > > the same SEL output I'll have to send it again, correct?
> > > > > > >
> > > > > > >
> > > > > > > I have all kinds of numbers on other machines,
> > > > > > > ie. 
> > > > > > > X10DRW-E => 2148
> > > > > > > X11SPi-TF => 2369
> > > > > > > X10SLL-F => 2049
> > > > > > > X10DRL-i => 2097
> > > > > > > X11DDW-NT => 2407
> > > > > > > X10SLH-F/X10SLM+-F/X10SLH-F/X10SLM+-F => 2051
> > > > > > >
> > > > > > >
> > > > > > > and so on.. I think we have at least 1/4 of the boards they
> > > > > > > manufacture.
> > > > > > > X9s are under 2000, X11 seems to be 23xx. But that's maybe
> > too
> > > > much
> > > > > > > reverse engineering to you ;)
> > > > > > > I can try to ping them and ask about details but I got no
> > > > offical
> > > > > > > contact with Supermicro.
> > > > > > >
> > > > > > >
> > > > > > > Best,
> > > > > > > Tom Hetmer
> > > > > > >
> > > > > > >
> > > > > > > CDN77 Operations
> > > > > > > address@hidden / +44 (0) 20 3514 2399 / www.cdn77.com
> > > > > > >
> > > > > > > ----- Původní zpráva ----- 
> > > > > > > > Odesilatel: "Albert Chu" <address@hidden> 
> > > > > > > > Příjemce: "Tom Hetmer" <address@hidden>,
> > freeipmi-use
> > > > address@hidden
> > > > > > > > nu
> > > > > > > > .org 
> > > > > > > > Datum: 12/04/18 19:40 
> > > > > > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on
> > > > supermicro 
> > > > > > > >
> > > > > > > > On Tue, 2018-12-04 at 11:39 +0100, Tom Hetmer wrote:
> > > > > > > > > Sure. It seems there's a similar ticket
> > > > > > > > > already: https://github.com/chu11/freeipmi-mirror/issue
> > s/19
> > > > > > > >
> > > > > > > > Ahh, if you could, update it with info from ipmitool /
> > > > ipmiutil.
> > > > > > > >  I
> > > > > > > > was
> > > > > > > > reluctant to add support based on reverse engineering.
> >  But
> > > > if
> > > > > > > > other
> > > > > > > > tools have "official" interpretations from Supermicro,
> > I'm
> > > > more
> > > > > > > > confident in the addition.
> > > > > > > >
> > > > > > > > > Yep, that's the code. ipmitool and a few others decode
> > it
> > > > too.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > We have a *lot* of Supermicros so I can help with
> > testing
> > > > if
> > > > > > > > > needed -
> > > > > > > > > but we don't get that much CRC errors though :)
> > > > > > > >
> > > > > > > > The one thing I'll need is product ID numbers (you can
> > get
> > > > from
> > > > > > > > bmc-
> > > > > > > > info) and the name of the product.  This goes into the
> > > > > > > > documentation
> > > > > > > > and some of the code.
> > > > > > > >
> > > > > > > > Thanks,
> > > > > > > >
> > > > > > > > Al
> > > > > > > >
> > > > > > > > > So I guess we'd have to wait till one pops up. But I
> > hope
> > > > the
> > > > > > > > > 'ver 2'
> > > > > > > > > method from ipmiutil works fine.
> > > > > > > > > We used ipmitool in our monitoring before and it was
> > > > accurate
> > > > > > > > > but
> > > > > > > > > slow, that's why I rewrote it all to use freeipmi.
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Thanks!
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > Best,
> > > > > > > > > Tom Hetmer
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > CDN77 Operations
> > > > > > > > > address@hidden / +44 (0) 20 3514 2399 /
> > www.cdn77.com
> > > > > > > > >
> > > > > > > > > ----- Původní zpráva ----- 
> > > > > > > > > > Odesilatel: "Albert Chu" <address@hidden> 
> > > > > > > > > > Příjemce: "Tom Hetmer" <address@hidden>,
> > > > freeipmi-
> > > > > > > > > > users
> > > > > > > > > > @gnu
> > > > > > > > > > .org 
> > > > > > > > > > Datum: 12/03/18 21:06 
> > > > > > > > > > Předmět: Re: [Freeipmi-users] Decoding ram errors on
> > > > > > > > > > supermicro 
> > > > > > > > > >
> > > > > > > > > > Hi Tom,
> > > > > > > > > >
> > > > > > > > > > Thanks for the pointer to ipmiutil's code.  I assume
> > you
> > > > > > > > > > found
> > > > > > > > > > this
> > > > > > > > > > comment:
> > > > > > > > > >
> > > > > > > > > > ---
> > > > > > > > > >       /* ver 2 method: 2A 80 = P1_DIMMB1
> > > > > > > > > >
> > > > */                                                           
> > > > > > > > > >   
> > > > > > > > > >     
> > > > > > > > > >                            
> > > > > > > > > >           /* SuperMicro
> > > > > > > > > >
> > > > says:                                                        
> > > > > > > > > >   
> > > > > > > > > >     
> > > > > > > > > >                                             
> > > > > > > > > >            *  pair: %c (data2 >> 4) + 0x40 + (data3 &
> > > > 0x3) *
> > > > > > > > > > 3,
> > > > > > > > > >
> > > > (='B')                                                       
> > > > > > > > > >   
> > > > > > > > > >     
> > > > > > > > > >     
> > > > > > > > > >            *  dimm: %c (data2 & 0xf) +
> > > > > > > > > >
> > > > 0x27,                                                        
> > > > > > > > > >   
> > > > > > > > > >     
> > > > > > > > > >                              
> > > > > > > > > >            *  cpu:  %x (data3 & 0x03) +
> > > > > > > > > >
> > > > 1);                                                          
> > > > > > > > > >   
> > > > > > > > > >     
> > > > > > > > > >                             
> > > > > > > > > >            */                       
> > > > > > > > > > ---
> > > > > > > > > >
> > > > > > > > > > I can definitely add it to my todo list.
> > > > > > > > > >
> > > > > > > > > > Would you mind writing up an issue on github here?
> > > > > > > > > >
> > > > > > > > > > https://github.com/chu11/freeipmi-mirror
> > > > > > > > > >
> > > > > > > > > > Al
> > > > > > > > > >
> > > > > > > > > > On Mon, 2018-12-03 at 17:55 +0100, Tom Hetmer wrote:
> > > > > > > > > > > Hi, 
> > > > > > > > > > >
> > > > > > > > > > > it'd be good if freeipmi supported decoding the
> > > > supermicro
> > > > > > > > > > > ECC
> > > > > > > > > > > errors.
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > Manufacturer: Supermicro
> > > > > > > > > > > Product Name: X10DRH LN4
> > > > > > > > > > > eg.
> > > > > > > > > > > freeipmi
> > > > > > > > > > > 1,Dec-01-2018,06:37:53,Sensor
> > > > > > > > > > > #0,Memory,Critical,Uncorrectable
> > > > > > > > > > > memory
> > > > > > > > > > > error ; OEM Event Data2 code = 3Ah ; OEM Event
> > Data3
> > > > code =
> > > > > > > > > > > 81h
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > web interface
> > > > > > > > > > > 1 | 12/01/2018 | 06:37:53 | Memory | Uncorrectable
> > ECC
> > > > > > > > > > > (@DIMMG1(CPU2)) | Asserted
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > something like this worked for me (stolen from
> > > > ipmiutil)
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $cpu = ($data3 & 0x03) + 1;
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $NPAIRS = 26;
> > > > > > > > > > > $rgpairs = "ABCDEFGHIJKLMNOPQRSTUVWXYZ";
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $bdata = "0x".$data2.$data3;
> > > > > > > > > > > $bdata = hexdec($bdata);
> > > > > > > > > > > $pair = (($bdata & 0xF0) >> 4) - 1;
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > if ($pair < 0) $pair = 0;
> > > > > > > > > > > if ($pair > $NPAIRS) $pair = $NPAIRS - 1;
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $pair = $rgpairs[$pair - 1];
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $dimm = $bdata & 0x0F;
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > $dimm may be incorrect as the original code
> > decrements
> > > > 9,
> > > > > > > > > > > but
> > > > > > > > > > > on
> > > > > > > > > > > that
> > > > > > > > > > > board it was wrong so i changed it to get the right
> > > > result
> > > > > > > > > > > -
> > > > > > > > > > > we'll
> > > > > > > > > > > see if it keeps getting the right values.
> > > > > > > > > > >
> > > > > > > > > > > Best,
> > > > > > > > > > > Tom Hetmer
> > > > > > > > > > >
> > > > > > > > > > >
> > > > > > > > > > > CDN77 Operations
> > > > > > > > > > > address@hidden / +44 (0) 20 3514 2399 /
> > > > www.cdn77.com
> > > > > > > > > > >
> > > > > > > > > > > _______________________________________________
> > > > > > > > > > > Freeipmi-users mailing list
> > > > > > > > > > > address@hidden
> > > > > > > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-use
> > rs
> > > > > > > > > >
> > > > > > > > > > -- 
> > > > > > > > > > Albert Chu
> > > > > > > > > > address@hidden
> > > > > > > > > > Computer Scientist
> > > > > > > > > > High Performance Systems Division
> > > > > > > > > > Lawrence Livermore National Laboratory
> > > > > > > > >
> > > > > > > > > _______________________________________________
> > > > > > > > > Freeipmi-users mailing list
> > > > > > > > > address@hidden
> > > > > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-users
> > > > > > > >
> > > > > > > > -- 
> > > > > > > > Albert Chu
> > > > > > > > address@hidden
> > > > > > > > Computer Scientist
> > > > > > > > High Performance Systems Division
> > > > > > > > Lawrence Livermore National Laboratory
> > > > > > >
> > > > > > > _______________________________________________
> > > > > > > Freeipmi-users mailing list
> > > > > > > address@hidden
> > > > > > > https://lists.gnu.org/mailman/listinfo/freeipmi-users
> > > --
> > > Albert Chu
> > > address@hidden
> > > Computer Scientist
> > > High Performance Systems Division
> > > Lawrence Livermore National Laboratory
> -- 
> Albert Chu
> address@hidden
> Computer Scientist
> High Performance Systems Division
> Lawrence Livermore National Laboratory



reply via email to

[Prev in Thread] Current Thread [Next in Thread]