[Linux-cluster] Kernel panic

James Chamberlain jamesc at exa.com
Tue Mar 11 16:45:10 UTC 2008


On Mar 11, 2008, at 10:13 AM, Bob Peterson wrote:

> On Mon, 2008-03-10 at 10:28 -0400, James Chamberlain wrote:
>> I have just had my cluster crash yet again, but this time, I was  
>> able to
>> capture the full kernel panic.
> <snip>
>> I'm experiencing upwards of 8 crashes a day because of this.  What  
>> can I do
>> about it?
>>
>> Thanks,
>>
>> James
>
> Hi James,
>
> The only times I've seen a problem like this is when GFS's resource
> group information somehow got corrupted.  I recommend doing this:
>
> 1. Unmount the file system from all nodes in your cluster

Is there an easy way to determine which filesystem(s) it is?  I have 13.

> 2. Back up your storage in any way you can without it being mounted
>   (dd it to another storage or tape or something?)
> 3. Run gfs_fsck on the file system.  If this is > 15TB, make sure
>   you run it on a 64-bit node.

All nodes in this cluster are 64-bit.  Are there any guidelines on how  
much memory I should have in each node?  Right now, they each have 2 GB.

> Hopefully your system isn't too old and you have a relatively recent
> version of gfs_fsck, which has the smarts to repair damaged RGs.

gfs-utils-0.1.12-1.el5

> I'm just guessing about the corruption, but given that, the
> next question is how it got corrupted.  There are a number of ways
> that can happen.  For example hardware problems, or running gfs_fsck
> while the file system is mounted on some node.  BTW, I've only seen
> RG corruption two or three times in the past 2+ years.

Is there a way I can find out for sure whether it's resource group  
corruption before I run gfs_fsck?

I have only had this cluster set up since December, and I started  
having problems with it not long after that.  At first, I was seeing a  
crash a day, and then I was having maybe one crash a week; however, I  
had a total of 47 reboots within the cluster yesterday.  I have also  
been somewhat concerned about the high load average on each node where  
a service is running.  For example, one node is serving 5 of those 13  
filesystems.  Its load average is currently and commonly hovering  
between 35 and 55.  On nodes that aren't running any services, the  
load average is 0.

Thanks,

James




More information about the Linux-cluster mailing list