Re: Need help with Reboot cause

PS = Pete Stieber
PS>> I have a dual opteron system that has been acting as
PS>> the worldly node for a small cluster of computers
PS>> since September, 2004.  The machine is running the
PS>> latest x86_64 Fedora 10 kernel that I recently loaded
PS>> (April 2).  The machine reboots without warning.  I
PS>> can't find the cause in log files (maybe I'm not
PS>> looking in the correct log).
PS>> I'm currently running memtest.  If all of the tests
PS>> pass, could the community suggest other diagnostic
PS>> tasks or information I could post to help diagnose the
PS>> problem?

m> Have you tried going back to the previous kernel?

The machine is still running memtest (no errors so far), but I already removed the prior kernel. I did notice reboots with the prior kernel. BTW my current kernel is

Reboots indicated by information in /var/log/messages...

Sunday    March 29   4:08
Tuesday   March 31   7:02
Thursday  April  2  18:27 Intentional reboot due to new kernel
Friday    April  3   1:36
Sunday    April  5   1:37
Sunday    April  5   2:48
Sunday    April  5   9:43
Sunday    April  5  13:20 as I was typing this email

m> Did you check dmesg and /var/log/messages?

Yes.  I can see reboots, but not the cause.

m> Does it boot normally and then just fail at some random
m> interval or is it consistently failing at the same point?

I have had top running during a few of the reboots. I have forced a couple of them by starting my nightly build process. The linker/loader has been running during some of the reboots...

top - 13:19:53 up  3:36,  6 users,  load average: 1.27, 2.70, 2.32
Tasks: 138 total,   6 running, 132 sleeping,   0 stopped,   0 zombie
Cpu(s): 40.8%us, 13.8%sy, 0.0%ni, 42.5%id, 2.7%wa, 0.0%hi, 0.3%si, 0.0%st
Mem:   2060232k total,  1683996k used,   376236k free,   164484k buffers
Swap:  2031608k total,       56k used,  2031552k free,  1230796k cached

 8878 pstieber  20   0 34552  25m 1096 R  7.6  1.3   0:00.23 ld
 8884 pstieber  20   0 48284  27m 1080 R  5.0  1.4   0:00.15 ld
    7 root      15  -5     0    0    0 S  0.3  0.0   0:00.17 ksoftirqd/1
22427 pstieber  20   0 14880 1208  872 R  0.3  0.1   0:03.49 top
    1 root      20   0  4096  876  616 S  0.0  0.0   0:00.71 init

Another instance

top - 06:55:13 up 17:34,  2 users,  load average: 2.83, 2.59, 1.86
Tasks: 127 total,   2 running, 125 sleeping,   0 stopped,   0 zombie
Cpu(s): 45.1%us, 4.7%sy, 0.0%ni, 49.8%id, 0.5%wa, 0.0%hi, 0.0%si, 0.0%st
Mem:   2060232k total,  1763404k used,   296828k free,   177052k buffers
Swap:  2031608k total,       56k used,  2031552k free,  1271964k cached

 5757 pstieber  20   0 79788  69m 1080 R 12.3  3.5   0:00.37 ld
    1 root      20   0  4096  876  616 S  0.0  0.0   0:00.68 init
    2 root      15  -5     0    0    0 S  0.0  0.0   0:00.00 kthreadd

I'm not sure this is always the case.

m> Other things you may consider:
m> CPU type?

Motherboard: Tyan Thunder K8W (S2885ANRF)
CPUs: Dual Opteron 244 (1.8 GHz) processors
Memory: 2 GB   4-512MB  CT6472Y40B  DDR PC3200 from Crucial

m> temperature?

Is there a command to monitor this while running the OS?

m> potential hard drive issue?

I have 3 SATA drives running. It's been so long since I have done this, but how does one manually do a disk chack?

m> any new hardware attached or installed recently?


m> Notice any power surges or brownouts?

The machine is on a UPS that deals with this.

m> any other nodes having issues?

No and they are not on UPSs.  They also do not have as large of a work load.

The machine in question is used for nightly builds and regression tests. I use distcc with the compute nodes to perform the builds.

The machine also runs samba to provide a network share to Windows users and provides authentication using Windows domain accounts.

m> Recent power surge zapped a board, DSL modem,
m> and the surge protector.

I doubt this is the problem.

Memtest make it through the first pass of all test successfully.

Thanks for the suggestions, especially considering my vague information.


