Hi,yesterday we had a system crashed, and looking on logs, the problem seems to be the following:
Jul 25 17:49:15 acacia kernel: mca: CPU 0 SAL log contains MCA error record
looking in /var/log/salinfo/raw, I get the following data:
[root acacia raw]# salinfo_decode acacia-2007-07-25-17\:46\:18-cpu0-mca0
oemdata_fd[0](6) oemdata_fd[1](5)
BEGIN HARDWARE ERROR STATE from acacia-2007-07-25-17:46:18-cpu0-mca0
Err Record ID: 3 SAL Rev: 0.02
Time: 2007-07-25 17:46:18 Severity 0
Processor Device Error Info Section
UNCORRECTED PROCESSOR ERROR: Bus Check
processor lid : 0x0000000000000000
cpu: A nasid: 0x0
processor state parameter: 0x20000000fff21120
rz [2]=0 rendezvous request unsuccessful
ra [3]=0 rendezvous was not attempted
mn [5]=1 min state registered with PAL
sy [6]=0 storage integrity not synchronized
co [7]=0 not continuable
ci [8]=1 machine check is isolated
mi [12]=1 more info available
pi [13]=0 ip logged is not precise
pm [14]=0 min state is not precise
dy [15]=0 processor dynamic state is not valid
rs [17]=1 rse is valid
cm [18]=0 fault has not been corrected
cr [20]=1 control registers are valid
pc [21]=1 performance counters are valid
dr [22]=1 debug registers are valid
tr [23]=1 translation registers are valid
rr [24]=1 region registers are valid
ar [25]=1 application registers are valid
br [26]=1 branch registers are valid
pr [27]=1 predicate registers are valid
fp [28]=1 floating point registers are valid
b1 [29]=1 bank one general registers are valid
b0 [30]=1 bank zero general registers are valid
gr [31]=1 general registers are valid
bc [61]=1 bus check
PAL recovery status:
error was isolated and contained, continuable if sw can recover
processor error map : 0x0000000001000000
processor code id: 0
logical thread id: 0
processor bus level 1 error
BUS Check Info [0]
Transaction size: 1, External Bus Error:, Type: 10 (I/O space
write), Severity: 0, Hierarchy: 0, Status information: 3 (Hard fail)
target identifier : 0x0003fffffc0f23c8
CPUID Regs: 0x49656e69756e6547 0x6c65746e 0 0x1f020104
Processor static data:
xip : 0xe00000000480b380 xfs : 0x0000000000000000
xpsr : 0x0000121008026018
[5:0]=24 User mask
be [1]=0 little endian
up [2]=0 user performance monitor disabled
ac [3]=1 alignment check enabled
mfl [4]=1 lower (f2 .. f31) floating-point registers
written
mfh [5]=0 upper (f32 .. f127) floating-point
registers not written
[23:0]=155672 System mask
ic [13]=1 interrupt collection enabled
i [14]=1 interrupts enabled
pk [15]=0 protection key disabled
dt [17]=1 data address translation enabled
dfl [18]=0 disabled floating-point low register not set
dfh [19]=0 disabled floating-point high register not set
sp [20]=0 secure performance monitor disabled
pp [21]=0 privileged performance monitor disabled
di [22]=0 disable instruction set transition not set
si [23]=0 secure interval timer disabled
db [24]=0 debug breakpoint fault disabled
lp [25]=0 lower privilege transfer trap disabled
tb [26]=0 taken branch trap disabled
rt [27]=1 register stack translation enabled
cpl [33:32]=0 current privilege level
is [34]=0 IA64 instruction set
mc [35]=0 machine check abort enabled
it [36]=1 instruction address translation enabled
id [37]=0 instruction debug fault enabled
da [38]=0 enable data access and dirty-bit faults
dd [39]=0 data debug fault enabled
ss [40]=0 single step disabled
ri [42:41]=1 restart instruction
ed [43]=0 exception deferral disabled
bn [44]=1 bank 1
ia [45]=0 instruction access-bit faults enabled
iip : 0xe0000000046d21e0 iipa : 0xe0000000046d21e0
ipsr : 0x0000121008026018
[5:0]=24 User mask
be [1]=0 little endian
up [2]=0 user performance monitor disabled
ac [3]=1 alignment check enabled
mfl [4]=1 lower (f2 .. f31) floating-point registers
written
mfh [5]=0 upper (f32 .. f127) floating-point
registers not written
[23:0]=155672 System mask
ic [13]=1 interrupt collection enabled
i [14]=1 interrupts enabled
pk [15]=0 protection key disabled
dt [17]=1 data address translation enabled
dfl [18]=0 disabled floating-point low register not set
dfh [19]=0 disabled floating-point high register not set
sp [20]=0 secure performance monitor disabled
pp [21]=0 privileged performance monitor disabled
di [22]=0 disable instruction set transition not set
si [23]=0 secure interval timer disabled
db [24]=0 debug breakpoint fault disabled
lp [25]=0 lower privilege transfer trap disabled
tb [26]=0 taken branch trap disabled
rt [27]=1 register stack translation enabled
cpl [33:32]=0 current privilege level
is [34]=0 IA64 instruction set
mc [35]=0 machine check abort enabled
it [36]=1 instruction address translation enabled
id [37]=0 instruction debug fault enabled
da [38]=0 enable data access and dirty-bit faults
dd [39]=0 data debug fault enabled
ss [40]=0 single step disabled
ri [42:41]=1 restart instruction
ed [43]=0 exception deferral disabled
bn [44]=1 bank 1
ia [45]=0 instruction access-bit faults enabled
isr : 0x00000a0200000000
[15:0]=0 Code
[23:16]=0 Vector
w [33]=1 write exception
ei [42:41]=1 excepting instruction
ed [43]=1 exception deferal
pr : 0xa4009650a69a6a29
p0, p3, p5, p9, p11, p13-14, p17, p19-20, p23, p25-26,
p29, p31, p36, p38, p41-42, p44, p47, p58, p61, p63
cr0 (dcr) : 0x0000000000007e04 cr1 (itm) :
0x0000017c35918c0b
cr2 (iva) : 0xe000000004400000 cr8 (pta) :
0x1ffc0000000000c9
cr16 (ipsr) : 0x0000121008026018 cr17 (isr) :
0x00000a0200000000
cr19 (iip) : 0xe0000000046d21e0 cr20 (ifa) :
0xc003fffffc0f23c8
cr21 (itir) : 0x0000000000000660 cr22 (iipa) :
0xe0000000046d21e0
cr23 (ifs) : 0x8000000000000389 cr24 (iim) :
0x0000000000045000
cr25 (iha) : 0xbffc0000000002b0 cr64 (lid) :
0x0000000000000000
cr66 (tpr) : 0x0000000000000000 cr68 (irr0) :
0x0480000000000000
cr69 (irr1) : 0x0000000000000000 cr70 (irr2) :
0x0000000000000000
cr71 (irr3) : 0x0000800000000000 cr72 (itv) :
0x00000000000000ef
cr73 (pmv) : 0x00000000000000ee cr74 (cmcv) :
0x000000000000001f
cr80 (lrr0) : 0x0000000000010000 cr81 (lrr1) :
0x0000000000010000
ar16 (rsc) : 0x0000000000000003 ar17 (bsp) :
0xe00000010f0893a8
ar18 (bspstore) : 0xe00000010f089150 ar19 (rnat) :
0x0000000000000000
ar32 (ccv) : 0x0000000000000001 ar36 (unat) :
0x0000000000000000
ar40 (fpsr) : 0x0009804c0270033f ar64 (pfs) :
0x0000000000000389
ar65 (lc) : 0x0000000000000000 ar66 (ec) :
0x0000000000000000
r0 : 0x0000000000000000 0xe000000004cbbd00
0x0000000000000001 0xe000000004b90760
r4 : 0x60000fffffffaf00 0x20000000003cd378
0x20000000003cdd40 0x0000000000000000
r8 : 0x00000000000000f2 0x0000000000000fff
0xe000000004b90758 0x0000000000000000
r12: 0xe00000010f08fc30 0xe00000010f088000
0x0000000000000006 0xe000000004998060
bk0 r16: 0xc003fffffc0f23c8 0x0010000000000661
0x0000000000000010 0x0013fffffc0f2671
bk0 r20: 0x00000a0200000000 0x00001a1008026018
0x0000000000000000 0x0000000000000000
bk0 r24: 0x0000000000000000 0x0000000000000000
0xc000000000000288 0x000000000000000f
bk0 r28: 0x2000000000307180 0x00001213085a6010
0x0000000000000000 0xa4009650a69a6a29
bk1 r16: 0xe000000004b90758 0x0000000000000000
0xc003fffffc000000 0xe000000004998068
bk1 r20: 0xe00000000499c048 0xe000000004b6fac0
0xe000000004b6fa60 0xe0000000049ef990
bk1 r24: 0xe000000004d75568 0xe000000004adef80
0xe000000004b7d670 0xe000000004b89ab8
bk1 r28: 0x60000fffffffaba0 0x0000000000000001
0x60000fffffffaa90 0xffffffffffffabf7
b0 : 0xe0000000046d21d0 0x400000000031c8e0
0x4000000000129080 0x0000000000000000
b4 : 0x0000000000000000 0x0000000000000000
0xe000000004402f70 0xe00000000480b300
k0 : 0x2000000000000000 0x5eb000ebc0004000
0x0000000000000000 0x0000000000000000
k4 : 0x000000000000010f 0xe0000003fcb48000
0x000000010f088000 0x000000010f080000
rr0 : 0x00000000009eb839 0x00000000009eb839
0x00000000009eb839 0x00000000009eb839
rr4 : 0x00000000009eb839 0x00000000009eb839
0x00000000009eb839 0x00000000009eb839
Platform Specific Error Info Section Platform Specific Error Detail Platform PCI Bus Error Info Section PCI Bus Error DetailError Status: 0x121900 Error Type: 0x4 Bus ID: 0xe0 Bus Address: 0xf4040064 Requestor ID: 0xfed2e000 Target ID: 0xf4040064
OEM Specific Data
0x0000 2e 12 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0010 98 00 00 00 00 00 00 00 82 44 e6 9f 2d a0 f7 4e
0x0020 ad e6 c6 63 59 62 53 99 00 00 00 00 00 00 07 00
0x0030 1c 02 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0040 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
0x0050 00 00 00 00 00 00 00 00 64 00 04 f4 00 00 00 00
0x0060 50 1d 00 00 00 00 00 00 48 00 00 00 00 00 00 00
0x0070 3c 10 2e 12 46 01 b0 22 02 00 20 00 37 02 00 0f
0x0080 00 00 00 00 00 00 00 00 07 00 01 00 00 ff 13 00
0x0090 03 24 03 00 1b 3f 02 00 48 00 00 00 00 00 00 00
0x00a0 60 94 c5 3a 2e 75 aa a1 00 00 00 00 00 00 00 00
END HARDWARE ERROR STATE from acacia-2007-07-25-17:46:18-cpu0-mca0
Can someone give me the directions how I can determine the cause of
the error?
Machine data: HP Integrity rx2620 Itanium 2 1.6 GHz, 16 GB RAM Red Hat Enterprise Linux ES release 3 (Taroon Update 5)Linux acacia 2.4.21-32.EL #1 SMP Fri Apr 15 21:02:52 EDT 2005 ia64 ia64 ia64 GNU/Linux
[root acacia raw]# lspci 00:01.0 USB Controller: NEC Corporation USB (rev 41) 00:01.1 USB Controller: NEC Corporation USB (rev 41) 00:01.2 USB Controller: NEC Corporation USB 2.0 (rev 02)00:02.0 IDE interface: Silicon Image, Inc. SiI 0649 Ultra ATA/100 PCI to ATA Host Controller (rev 02) 00:1c.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32)
00:1d.0 Bridge: Hewlett-Packard Company zx1 I/O Controller (rev 23) 00:1e.0 Bridge: Hewlett-Packard Company zx1 System Bus Adapter (rev 23)20:01.0 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) 20:01.1 SCSI storage controller: LSI Logic / Symbios Logic 53c1030 PCI-X Fusion-MPT Dual Ultra320 SCSI (rev 08) 20:02.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) 20:02.1 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) 20:1e.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32)
40:01.0 PCI bridge: IBM PCI-X to PCI-X Bridge (rev 03)40:1e.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32) 41:04.0 RAID bus controller: Compaq Computer Corporation Smart Array 64xx (rev 01) 60:1e.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32) 80:01.0 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) 80:01.1 Ethernet controller: Intel Corporation 82546GB Gigabit Ethernet Controller (rev 03) 80:1e.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32) c0:1e.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32) e0:01.0 Communication controller: Hewlett-Packard Company Auxiliary Diva Serial Port (rev 01) e0:01.1 Serial controller: Hewlett-Packard Company Diva Serial [GSP] Multiport UART (rev 03) e0:02.0 VGA compatible controller: ATI Technologies Inc Radeon RV100 QY [Radeon 7000/VE] e0:1e.0 Host bridge: Hewlett-Packard Company zx1 Local Bus Adapter (rev 32)
[root acacia raw]# Thanx a lot. Ivan Marinkovic Chile.
Attachment:
PGP.sig
Description: Mensaje firmado digitalmente