[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] GFS failure



Anthony wrote:
Hello,

yesterday,
we had a full GFS system Fail,
all partitions were unaccessible from all the 32 nodes.
and now all the cluster is inaccessible.
did any one had already seen this problem?


GFS: Trying to join cluster "lock_gulm", "gen:ir"
GFS: fsid=gen:ir.32: Joined cluster. Now mounting FS...
GFS: fsid=gen:ir.32: jid=32: Trying to acquire journal lock...
GFS: fsid=gen:ir.32: jid=32: Looking at journal...
GFS: fsid=gen:ir.32: jid=32: Done

NETDEV WATCHDOG: jnet0: transmit timed out
ipmi_kcs_sm: kcs hosed: Not in read state for error2
NETDEV WATCHDOG: jnet0: transmit timed out
ipmi_kcs_sm: kcs hosed: Not in read state for error2

GFS: fsid=gen:ir.32: fatal: filesystem consistency error
GFS: fsid=gen:ir.32:   function = trans_go_xmote_bh
GFS: fsid=gen:ir.32: file = /usr/src/build/626614-x86_64/BUILD/gfs-kernel-2.6.9-42/smp/src/gfs/glops.c, line = 542
GFS: fsid=gen:ir.32:   time = 1150223491
GFS: fsid=gen:ir.32: about to withdraw from the cluster
GFS: fsid=gen:ir.32: waiting for outstanding I/O
GFS: fsid=gen:ir.32: telling LM to withdraw
Hi Anthony,

This problem could be caused by a couple of things. Basically, it indicates a filesystem consistency error occurred. In this particular case, it means that a write was done to the file system, and a transaction lock was taken out, but after the write transaction, the journal for the written data was found to be still in use. That means one of two things:

Either (1) some process was writing to the GFS journal when they shouldn't be (i.e. without the necessary lock) or else (2) the journal data written was somehow corrupted on disk.

In the past, we've often tracked down such problems to hardware failures; in other words, even without the GFS file system in the loop, if you use a command like 'dd' to send data to the raw hard disk device, then use dd to retrieve it, the data comes back from the hardware different than what was written out. That particular scenario is documented as bugzilla bug
175589.

I'm not saying that is your problem, but I'm saying that's what we've seen in the past.

My recommendation is to read the bugzilla, back up your entire file system or copy it to a different set of drives, then perhaps you can do some hardware tests as described in the bugzilla to see whether your hardware can consistently write data, read it back, and get a match between what was written and what was read back. Do this test without GFS in there at all, and hopefully with only one node accessing that storage at a time.

You will probably also want to run gfs_fsck before mounting again to check the consistency of the file system, just in case some rogue process on one of the nodes was doing something
destructive.

WARNING: overwriting your GFS file system will of course damage what was there, so you better be careful not to destroy your data and make a copy before doing this.

If the hardware checks out 100% and you can recreate the failure, open a bugzilla against GFS and we'll go from there. In other words, we don't know of any problems with GFS that
can cause this, beyond hardware problems.

I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]