[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Kernel panic



On Mar 11, 2008, at 10:13 AM, Bob Peterson wrote:

On Mon, 2008-03-10 at 10:28 -0400, James Chamberlain wrote:
I have just had my cluster crash yet again, but this time, I was able to
capture the full kernel panic.
<snip>
I'm experiencing upwards of 8 crashes a day because of this. What can I do
about it?

Thanks,

James

Hi James,

The only times I've seen a problem like this is when GFS's resource
group information somehow got corrupted.  I recommend doing this:

1. Unmount the file system from all nodes in your cluster

Is there an easy way to determine which filesystem(s) it is?  I have 13.

2. Back up your storage in any way you can without it being mounted
  (dd it to another storage or tape or something?)
3. Run gfs_fsck on the file system.  If this is > 15TB, make sure
  you run it on a 64-bit node.

All nodes in this cluster are 64-bit. Are there any guidelines on how much memory I should have in each node? Right now, they each have 2 GB.

Hopefully your system isn't too old and you have a relatively recent
version of gfs_fsck, which has the smarts to repair damaged RGs.

gfs-utils-0.1.12-1.el5

I'm just guessing about the corruption, but given that, the
next question is how it got corrupted.  There are a number of ways
that can happen.  For example hardware problems, or running gfs_fsck
while the file system is mounted on some node.  BTW, I've only seen
RG corruption two or three times in the past 2+ years.

Is there a way I can find out for sure whether it's resource group corruption before I run gfs_fsck?

I have only had this cluster set up since December, and I started having problems with it not long after that. At first, I was seeing a crash a day, and then I was having maybe one crash a week; however, I had a total of 47 reboots within the cluster yesterday. I have also been somewhat concerned about the high load average on each node where a service is running. For example, one node is serving 5 of those 13 filesystems. Its load average is currently and commonly hovering between 35 and 55. On nodes that aren't running any services, the load average is 0.

Thanks,

James


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]