Kernel 2.6.9-55 issues

Troy Knabe knabe at 4j.lane.edu
Mon May 14 18:11:00 UTC 2007


I can't reboot right now.  I am not trying to emulate any kind of raid.  It's internal disk is just a 250 gig SATA drive.  I have rebooted on th previous kernel and it is working perfectly well.  Attached is the boot log (both on clean kernel and the one with errors). The message file is too large to send to the red hat list.

Thanks
-Troy 

-----Original Message-----
From: redhat-list-bounces at redhat.com [mailto:redhat-list-bounces at redhat.com] On Behalf Of George Magklaras
Sent: Sunday, May 13, 2007 11:02 PM
To: General Red Hat Linux discussion list
Subject: Re: Kernel 2.6.9-55 issues

Troy,

I assume you have a backup if this is a production system. Can you try and boot the system with the "nodmraid" option and see the outcome. It would help to tell me the disk config, as originally requested. There are issues with some nVidia SATA controllers. If these work essentially as "fake RAID" devices (as far as I know the lspci output below does not suggest a real hardware RAID controller), the dmraid module could create hickups and kernel panics. Disabling this with the nodmraid option in the kernel boot line (from your bootloader) could have varying results, depending on what type of RAID you are trying to emulate. That is the only thing I can suspect, if your hardware works perfectly well on the previous kernel. Any chance of capturing the boot log and your dmesg when your system boots properly (previous kernel)?

GM


Troy Knabe wrote:
> The system boots and starts the kernel, then crashes. I wasn't watching the first time, so on a subsequent boot it gets to the point where it does a disk check because the system was not shut down cleanly.  At different points in the disk check is where it crashes and reboots now.  Thanks for any help you can provide.  
> 
> lspci
> 00:00.0 Memory controller: nVidia Corporation CK804 Memory Controller 
> (rev a3) 00:01.0 ISA bridge: nVidia Corporation CK804 ISA Bridge (rev 
> a3)
> 00:01.1 SMBus: nVidia Corporation CK804 SMBus (rev a2) 00:02.0 USB 
> Controller: nVidia Corporation CK804 USB Controller (rev a2)
> 00:02.1 USB Controller: nVidia Corporation CK804 USB Controller (rev 
> a3) 00:06.0 IDE interface: nVidia Corporation CK804 IDE (rev f2) 
> 00:07.0 IDE interface: nVidia Corporation CK804 Serial ATA Controller 
> (rev f3) 00:08.0 IDE interface: nVidia Corporation CK804 Serial ATA 
> Controller (rev f3) 00:09.0 PCI bridge: nVidia Corporation CK804 PCI 
> Bridge (rev a2) 00:0a.0 Ethernet controller: nVidia Corporation CK804 
> Ethernet Controller (rev a3) 00:0b.0 PCI bridge: nVidia Corporation 
> CK804 PCIE Bridge (rev a3) 00:0c.0 PCI bridge: nVidia Corporation 
> CK804 PCIE Bridge (rev a3) 00:0d.0 PCI bridge: nVidia Corporation 
> CK804 PCIE Bridge (rev a3) 00:0e.0 PCI bridge: nVidia Corporation 
> CK804 PCIE Bridge (rev a3) 00:18.0 Host bridge: Advanced Micro Devices 
> [AMD] K8 [Athlon64/Opteron] HyperTransport Technology Configuration
> 00:18.1 Host bridge: Advanced Micro Devices [AMD] K8 
> [Athlon64/Opteron] Address Map
> 00:18.2 Host bridge: Advanced Micro Devices [AMD] K8 
> [Athlon64/Opteron] DRAM Controller
> 00:18.3 Host bridge: Advanced Micro Devices [AMD] K8 
> [Athlon64/Opteron] Miscellaneous Control 01:05.0 VGA compatible 
> controller: ATI Technologies Inc Rage XL (rev 27) 04:00.0 Ethernet 
> controller: Broadcom Corporation NetXtreme BCM5721 Gigabit Ethernet 
> PCI Express (rev 11)
> 
> lsmod
> Module                  Size  Used by
> ipt_state               1985  1 
> ip_conntrack           41077  1 ipt_state
> ipt_multiport           2113  3 
> ipt_LOG                 6593  1 
> iptable_filter          3009  1 
> ip_tables              17601  4 ipt_state,ipt_multiport,ipt_LOG,iptable_filter
> parport_pc             24833  0 
> lp                     12333  0 
> parport                37513  2 parport_pc,lp
> autofs4                25157  0 
> i2c_dev                11585  0 
> i2c_core               22337  1 i2c_dev
> sunrpc                163237  1 
> dm_mirror              30893  0 
> dm_mod                 59989  1 dm_mirror
> button                  6737  0 
> battery                 9029  0 
> ac                      4933  0 
> md5                     4161  1 
> ipv6                  235777  39 
> joydev                 10497  0 
> ohci_hcd               21841  0 
> ehci_hcd               31301  0 
> forcedeth              24001  0 
> tg3                   107077  0 
> ext3                  117193  3 
> jbd                    71385  1 ext3
> sata_nv                 9541  4 
> libata                 66333  1 sata_nv
> sd_mod                 17217  5 
> scsi_mod              122445  2 libata,sd_mod
> 
>  
> 
> -----Original Message-----
> From: redhat-list-bounces at redhat.com 
> [mailto:redhat-list-bounces at redhat.com] On Behalf Of George Magklaras
> Sent: Friday, May 11, 2007 1:27 AM
> To: General Red Hat Linux discussion list
> Subject: Re: Kernel 2.6.9-55 issues
> 
> Troy, what is your disk subsystem on the x2200? At what point it won't boot? Does it reach the bootloader and at least start the kernel? Also if you could do an 'lspci' and an lsmod and show the output from your good kernel.
> 
> 
> ##The following is a guess##
> I don't have that kind of Sun kit, but there are all sorts of references to stability problems with AMD based chipsets. Also, FYI there is a kernel panic report for that kernel here:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=239484
> 
> This bug report concerns the Error Detection And Correction (EDAC) modules (hence the lsmod prompt). This comes from the edac kernel module thinking that there is something wrong with the bus or the memory. For your x2200, the system probably panics (any messages from the console during the boot failure?), as there is an option that defines a kernel panic on a kernel detecting EDAC parity errors. On your x1440 that are able to boot but they give the EDAC messages, do an lsmod and grep -i for edac.  They seem to point out a 'noedac' boot option, but I am not sure.
> 
> On the x1440 that spawn the edac messages, see if the /etc/modprobe.conf
>   contains any references to the edac modules and you could try to remove them, see if that makes a difference.
> 
> GM
> 
> 
> Troy Knabe wrote:
>> I upgraded from 2.6.9-42 to 2.6.9-55 kernel over the weekend.  I have had issues with 3 servers.  1 server wouldn't boot (x2200 amd 148 proc).  And two x4100's with 2 - Dual Core AMD Opteron(tm) Processor 285.  The two x4100's are spewing these errors, but if I reboot them with the old 2.6.9-42 kernel then I don't get any of them.  Anyone else experiencing issues with the new kernel?
>>  
>> thanks
>> -Troy
>>  
>> May  9 16:25:43 hostname kernel: EDAC k8 MC0: general bus error: 
>> participating processor(local node response), time-out(no timeout) 
>> memory transaction type(generic read), mem or i/o(mem access), cache 
>> level(generic)May  9 16:25:43 hostname kernel: MC0: CE page 0xc, 
>> offset 0x108, grain 8, syndrome 0x4b39, row 0, channel 1, label "":
>> k8_edacMay  9 16:25:43 hostname kernel: MC0: CE - no information
>> available: k8_edac Error Overflow setMay  9 16:25:43 hostname kernel: 
>> EDAC k8 MC0: extended error code: ECC chipkill x4 errorMay  9 
>> 16:25:44 hostname kernel: EDAC k8 MC0: general bus error: 
>> participating processor(local node origin), time-out(no timeout) 
>> memory transaction type(generic read), mem or i/o(mem access), cache 
>> level(generic)May  9
>> 16:25:44 hostname kernel: MC0: CE page 0x1f1, offset 0x0, grain 8, 
>> syndrome 0x28d8, row 3, channel 1, label "": k8_edacMay  9 16:25:44 
>> hostname kernel: MC0: CE - no information available: k8_edac Error 
>> Overflow setMay  9 16:25:45 hostname kerne
> l: EDAC k8 MC0: extended error code: ECC chipkill x4 errorMay  9 
> 16:25:46 hostname kernel: EDAC k8 MC0: general bus error: 
> participating processor(local node origin), time-out(no timeout) 
> memory transaction type(generic read), mem or i/o(mem access), cache 
> level(generic)May  9 16:25:46 hostname kernel: MC0: CE page 0x1f1, 
> offset 0x0, grain 8, syndrome 0x28d8, row 3, channel 1, label "": 
> k8_edacMay  9 16:25:46 hostname kernel: MC0: CE - no information 
> available: k8_edac Error Overflow setMay  9 16:25:46 hostname kernel: 
> EDAC k8 MC0: extended error code: ECC chipkill x4 errorMay  9 16:25:47 
> hostname kernel: EDAC k8 MC0: general bus error: participating 
> processor(local node origin), time-out(no timeout) memory transaction 
> type(generic read), mem or i/o(mem access), cache level(generic)May  9 
> 16:25:47 hostname kernel: MC0: CE page 0x138, offset 0xac0, grain 8, 
> syndrome 0xeeff, row 0, channel 1, label "": k8_edacMay  9 16:25:47 
> hostname kernel: MC0: CE - no information available
: 
> k8_edac Error Overflow setMay  9 16:25:47 hostname kernel: EDAC k8 
> MC0: extended error code: ECC chipkill x4 error
>>  
> 
> --
> --
> George Magklaras
> 
> Senior Computer Systems Engineer/UNIX Systems Administrator EMBnet 
> Technical Management Board The Biotechnology Centre of Oslo, 
> University of Oslo http://www.biotek.uio.no/
> 
> EMBnet Norway:	http://www.no.embnet.org/
> 
> 
> --
> redhat-list mailing list
> unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
> https://www.redhat.com/mailman/listinfo/redhat-list
> 




--
redhat-list mailing list
unsubscribe mailto:redhat-list-request at redhat.com?subject=unsubscribe
https://www.redhat.com/mailman/listinfo/redhat-list


More information about the redhat-list mailing list