Re: [Linux-cluster] Kernel panic

On Mon, 2008-03-10 at 10:28 -0400, James Chamberlain wrote:
> I have just had my cluster crash yet again, but this time, I was able to 
> capture the full kernel panic.
> I'm experiencing upwards of 8 crashes a day because of this.  What can I do 
> about it?
> Thanks,
> James

Hi James,

The only times I've seen a problem like this is when GFS's resource
group information somehow got corrupted.  I recommend doing this:

1. Unmount the file system from all nodes in your cluster
2. Back up your storage in any way you can without it being mounted
   (dd it to another storage or tape or something?)
3. Run gfs_fsck on the file system.  If this is > 15TB, make sure
   you run it on a 64-bit node.

Hopefully your system isn't too old and you have a relatively recent
version of gfs_fsck, which has the smarts to repair damaged RGs.

I'm just guessing about the corruption, but given that, the
next question is how it got corrupted.  There are a number of ways
that can happen.  For example hardware problems, or running gfs_fsck
while the file system is mounted on some node.  BTW, I've only seen
RG corruption two or three times in the past 2+ years.


Bob Peterson
Red Hat GFS

