[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
Re: XP1000 667 mhz locking up.
- From: Jay Estabrook compaq com
- To: axp-list redhat com
- Subject: Re: XP1000 667 mhz locking up.
- Date: Thu, 16 Dec 1999 17:34:38 -0500
>>> "Wes Bauske" said:
> Peter Rival wrote:
> > Greg Lindahl wrote:
> >
> > > > tsunami_machine_check: vector=0x630 la_ptr=0xfffffc0000006000
> > > > pc=0x120230514 size=0x80 procoffset=0x18 sysoffset
> > >
> > > vector=0x630 is the key number. Usually these are hardware problems,
> > > and if you could get Compaq to tell you what 0x630 is for that chipset,
> > > you'd know.
The details for machine checks are in the Alpha Architecture Reference
Manual, if you have the time to dig them out. If not:
Machine checks are of four (4) types, as in the following table:
Code Reason Example or Common Cause
==== ====== =======================
620 System Correctable correctable errors in the memory subsystem,
eg single bit ECC errors, detected async to
processor execution
630 Processor Correctable correctable cache and TLB errors, detected
internally by the processor
660 System Uncorrectable unrecoverable memory errors
670 Processor Uncorrectable unrecoverable cache or TLB errors, or
read of a non-existent I/O space location
Given a 630, one would suspect either (some part of) the CPU, or
perhaps the motherboard cache, but probably not the memory DIMMs, as
those usually reported the error as 620, I believe.
> If it's a correctable error, the system shouldn't stop.
> It should be just a warning to get your HW checked out.
Well, yes, but if you get a continuous stream of them, it hasn't stopped
but the system is hung for all intents and purposes... :-\
> My system has these occasionally too and does not stop.
Keyword here is "occasionally".
> I would susect the second of your messages:
>
> > TSUNAMI_pci_clr_err: PERROR after read 0x0
>
> Is what your hang is from. Either your PCI bus is
> hosed or a card in it is and is not responding.
Not likely; that routine and message appear during *all* machine checks
on the TSUNAMI-based (EV6) boxes. Also, I've never been able to attribute
a PCI problem with correctable errors... ;-}
> It would be nice if the error message was more specific
> about what it was doing when it failed... (ie., what the
> error was and it was trying to clear what device?)
Indeed, and there are some COMPAQ folks that are looking into it at
this very moment, so the situation should improve. Up to this point it
has been low enough frequency-of-need to stay on our TODO list but
never fully get our attention.
Note that it *is* possible (hardware-wise) to disable the reporting of
correctable errors; I can suggest patches if you get desperate.
However, I wouldn't just blithely go ahead and do it, as it is only
the *reporting* to system software (ie the LINUX kernel) that gets
disabled - the PALcode still takes each machine check interrupt and
then decides to ignore it because it has been disabled.
--Jay++
-----------------------------------------------------------------------------
Jay A Estabrook Alpha Engineering - LINUX Project
Compaq Computer Corp. - MRO1-2/K20 (508) 467-2080
200 Forest Street, Marlboro, MA 01752 Jay.Estabrook@compaq.com
-----------------------------------------------------------------------------
[Date Prev][Date Next] [Thread Prev][Thread Next]
[Thread Index]
[Date Index]
[Author Index]
[]