[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Problems with EDAC module



I am testing the RHEL4 U3 Beta on an Intel EM64T based system. This is the x86-64/EM64T version of the distribution.

The install completed successfully, but upon reboot, the system panic's during rc.sysinit around "remounting root" or "No Software RAID found" (from dmraid -ay).

The panic is:

MC0: Uncorrected Error

That's clearly from the new EDAC feature which was added in the release.

I've tried two different motherboard/CPU sets and two completely different sets of RAM. None of this hardware has exhibited any problems in the past. So I'm fairly certain this is a false positive.

I tried several different ways to disable the "panic_on_ue" behavior on the kernel command line, but "edac_mc.panic_on_ue=0" didn't work, nor did any of the others.

Ultimately, I had to boot into rescue mode and kill the edac modules with the following in /etc/modprobe.conf:

alias e752x_edac /dev/null
alias edac_mc /dev/null

Then I was able to boot and the system appears to be running without problems.

Further, I am able to "insmod edac_mc panic_on_ue=0" and load e752x_edac without problems. The e752x_edac module does *not* log any memory errors after I manually load the modules.

Now that I had the system up, I changed the /etc/modprobe.conf to read:

options edac_mc panic_on_ue=0

and tried rebooting the system. Now the system boots and runs just fine except that the log is filling up with the attached error message. Clearly something isn't initialized or being read correctly.

But after unloading and reloading the e752x_edac module, everything is fine:

MC0: Removed device 0 for e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0)
tolm = 20000, remapbase = ffc000, remaplimit = 0
MC0: Giving out device to e752x_edac E7520: PCI 0000:00:00.0 (0000:00:00.0)


And no further errors are reported.

So it seems that the hotplug loading of e752x_edac in /etc/rc.sysinit (via kmodule) is causing things to be initialized badly. Perhaps there is a race condition of some kind between edac_mc and e752x_edac loading?

What additional information and tests can I run to track down the root of the problem?

I've searched bugzilla, but I haven't found any bugs *at all* against the RHEL4U3 Beta. Perhaps I'm searching the wrong catagories?

I'll bugzilla if I can get some advice on the right product/release/component to log against.

Thanks!
:v)

PS. I've got a system with a different motherboard but the same chipset that I'll try on next.

Fatal Error PCI Express C1
Fatal Error PCI Express C
Fatal Error PCI Express B1
Fatal Error PCI Express B
Fatal Error PCI Express A1
Fatal Error PCI Express A
Fatal Error DMA Controler
Fatal Error HUB Interface
Fatal Error System Bus
Fatal Error DRAM Controler
Non-Fatal Error PCI Express C1
Non-Fatal Error PCI Express C
Non-Fatal Error PCI Express B1
Non-Fatal Error PCI Express B
Non-Fatal Error PCI Express A1
Non-Fatal Error PCI Express A
Non-Fatal Error DMA Controler
Non-Fatal Error HUB Interface
Non-Fatal Error System Bus
Non-Fatal Error DRAM Controler
Non-Fatal Error Internal Buffer
Fatal Error PCI Express C1
Fatal Error PCI Express C
Fatal Error PCI Express B1
Fatal Error PCI Express B
Fatal Error PCI Express A1
Fatal Error PCI Express A
Fatal Error DMA Controler
Fatal Error HUB Interface
Fatal Error System Bus
Fatal Error DRAM Controler
Non-Fatal Error PCI Express C1
Non-Fatal Error PCI Express C
Non-Fatal Error PCI Express B1
Non-Fatal Error PCI Express B
Non-Fatal Error PCI Express A1
Non-Fatal Error PCI Express A
Non-Fatal Error DMA Controler
Non-Fatal Error HUB Interface
Non-Fatal Error System Bus
Non-Fatal Error DRAM Controler
Non-Fatal Error Internal Buffer
Fatal Error HI Address or Command Parity
Fatal Error HI Illegal Access
Fatal Error Out of Range Access
Fatal Error Enhanced Config Access
Non-Fatal Error HI Internal Parity
Non-Fatal Error HI Data Parity
Non-Fatal Error Hub Interface Target Abort
Fatal Error HI Address or Command Parity
Fatal Error HI Illegal Access
Fatal Error Out of Range Access
Fatal Error Enhanced Config Access
Non-Fatal Error HI Internal Parity
Non-Fatal Error HI Data Parity
Non-Fatal Error Hub Interface Target Abort
Fatal Error System Bus PCI Express C1
Fatal Error System Bus PCI Express C
Fatal Error System Bus HUB Interface
Non-Fatal Error System Bus PCI Express B1
Non-Fatal Error System Bus PCI Express B
Non-Fatal Error System Bus PCI Express A1
Non-Fatal Error System Bus PCI Express A
Non-Fatal Error System Bus DMA Controler
Non-Fatal Error System Bus System Bus
Non-Fatal Error System Bus DRAM Controler
Fatal Error System Bus PCI Express C1
Fatal Error System Bus PCI Express C
Fatal Error System Bus HUB Interface
Non-Fatal Error System Bus PCI Express B1
Non-Fatal Error System Bus PCI Express B
Non-Fatal Error System Bus PCI Express A1
Non-Fatal Error System Bus PCI Express A
Non-Fatal Error System Bus DMA Controler
Non-Fatal Error System Bus System Bus
Non-Fatal Error System Bus DRAM Controler
Non-Fatal Error Internal PMWB to DRAM parity
Non-Fatal Error Internal PMWB to System Bus Parity
Non-Fatal Error Internal System Bus or IO to PMWB Parity
Non-Fatal Error Internal DRAM to PMWB Parity
Non-Fatal Error Internal PMWB to DRAM parity
Non-Fatal Error Internal PMWB to System Bus Parity
Non-Fatal Error Internal System Bus or IO to PMWB Parity
Non-Fatal Error Internal DRAM to PMWB Parity
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: CE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: CE - no information available: INTERNAL ERROR
MC0: UE - no information available: e752x UE log memory write
MC0: UE - no information available: e752x UE log memory write
MC0: could not look up page error address ffffff
MC0: CE page 0xffffff, row -1 : Memory read retry
MC0: could not look up page error address ffffff
MC0: CE page 0xffffff, row -1 : Memory read retry
MC0: Memory threshold CE
MC0: Memory threshold CE
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR
MC0: could not look up page error address ffffff
MC0: INTERNAL ERROR: row out of range (-1 >= 8)
MC0: UE - no information available: INTERNAL ERROR

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]