[Linux-cluster] Kernel panic

Tue Mar 11 16:45:10 UTC 2008

On Mar 11, 2008, at 10:13 AM, Bob Peterson wrote:

> On Mon, 2008-03-10 at 10:28 -0400, James Chamberlain wrote:
>> I have just had my cluster crash yet again, but this time, I was  
>> able to
>> capture the full kernel panic.
> <snip>
>> I'm experiencing upwards of 8 crashes a day because of this.  What  
>> can I do
>> about it?
>>
>> Thanks,
>>
>> James
>
> Hi James,
>
> The only times I've seen a problem like this is when GFS's resource
> group information somehow got corrupted.  I recommend doing this:
>
> 1. Unmount the file system from all nodes in your cluster

Is there an easy way to determine which filesystem(s) it is?  I have 13.

> 2. Back up your storage in any way you can without it being mounted
>   (dd it to another storage or tape or something?)
> 3. Run gfs_fsck on the file system.  If this is > 15TB, make sure
>   you run it on a 64-bit node.

All nodes in this cluster are 64-bit.  Are there any guidelines on how  
much memory I should have in each node?  Right now, they each have 2 GB.

> Hopefully your system isn't too old and you have a relatively recent
> version of gfs_fsck, which has the smarts to repair damaged RGs.

gfs-utils-0.1.12-1.el5

> I'm just guessing about the corruption, but given that, the
> next question is how it got corrupted.  There are a number of ways
> that can happen.  For example hardware problems, or running gfs_fsck
> while the file system is mounted on some node.  BTW, I've only seen
> RG corruption two or three times in the past 2+ years.

Is there a way I can find out for sure whether it's resource group  
corruption before I run gfs_fsck?

I have only had this cluster set up since December, and I started  
having problems with it not long after that.  At first, I was seeing a  
crash a day, and then I was having maybe one crash a week; however, I  
had a total of 47 reboots within the cluster yesterday.  I have also  
been somewhat concerned about the high load average on each node where  
a service is running.  For example, one node is serving 5 of those 13  
filesystems.  Its load average is currently and commonly hovering  
between 35 and 55.  On nodes that aren't running any services, the  
load average is 0.

Thanks,

James