[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: XP1000 667 mhz locking up.



>>> "Wes Bauske" said:
> Peter Rival wrote:
> > Greg Lindahl wrote:
> > 
> > > > tsunami_machine_check: vector=0x630 la_ptr=0xfffffc0000006000
> > > >                  pc=0x120230514 size=0x80 procoffset=0x18 sysoffset
> > >
> > > vector=0x630 is the key number. Usually these are hardware problems,
> > > and if you could get Compaq to tell you what 0x630 is for that chipset,
> > > you'd know.

The details for machine checks are in the Alpha Architecture Reference
Manual, if you have the time to dig them out. If not:

Machine checks are of four (4) types, as in the following table:

Code      Reason                  Example or Common Cause
====      ======                  =======================

620	System Correctable        correctable errors in the memory subsystem,
				  eg single bit ECC errors, detected async to
				  processor execution

630	Processor Correctable     correctable cache and TLB errors, detected
				  internally by the processor

660	System Uncorrectable      unrecoverable memory errors

670	Processor Uncorrectable   unrecoverable cache or TLB errors, or
                                  read of a non-existent I/O space location

Given a 630, one would suspect either (some part of) the CPU, or
perhaps the motherboard cache, but probably not the memory DIMMs, as
those usually reported the error as 620, I believe.

> If it's a correctable error, the system shouldn't stop.
> It should be just a warning to get your HW checked out.

Well, yes, but if you get a continuous stream of them, it hasn't stopped
but the system is hung for all intents and purposes... :-\
 
> My system has these occasionally too and does not stop.

Keyword here is "occasionally".

> I would susect the second of your messages:
> 
> > TSUNAMI_pci_clr_err: PERROR after read 0x0
> 
> Is what your hang is from. Either your PCI bus is
> hosed or a card in it is and is not responding.

Not likely; that routine and message appear during *all* machine checks
on the TSUNAMI-based (EV6) boxes. Also, I've never been able to attribute
a PCI problem with correctable errors... ;-}

> It would be nice if the error message was more specific
> about what it was doing when it failed... (ie., what the
> error was and it was trying to clear what device?)

Indeed, and there are some COMPAQ folks that are looking into it at
this very moment, so the situation should improve. Up to this point it
has been low enough frequency-of-need to stay on our TODO list but
never fully get our attention.

Note that it *is* possible (hardware-wise) to disable the reporting of
correctable errors; I can suggest patches if you get desperate.

However, I wouldn't just blithely go ahead and do it, as it is only
the *reporting* to system software (ie the LINUX kernel) that gets
disabled - the PALcode still takes each machine check interrupt and
then decides to ignore it because it has been disabled.

--Jay++

-----------------------------------------------------------------------------
Jay A Estabrook                            Alpha Engineering - LINUX Project
Compaq Computer Corp. - MRO1-2/K20         (508) 467-2080
200 Forest Street, Marlboro, MA  01752     Jay.Estabrook@compaq.com
-----------------------------------------------------------------------------




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index] []