[Linux-cluster] Fwd: GFS volume hangs on 3 nodes after gfs_grow

Fri Sep 26 15:33:59 UTC 2008

----- "Alan A" <alan.zg at gmail.com> wrote:
| This is worse than I tought. The entire cluster is hanging upon
| restart
| command issued from the Conga - lucy box. I tried bringing gfs service
| down
| on node2 (lucy) with the: service gfs stop (we are not running
| rgmanager),
| and I got:
| FATAL: Module gfs is in use.

Hi Alan,

It sounds like conga can't reboot the cluster because the GFS file
system is still mounted, or is in use.  I don't know much about conga,
so forgive my ignorance there.  You may need to unmount the gfs file
system before you reboot.  The dmesg you sent looked perfectly normal
to me.  Those are normal openais messages.  I'm more interested to
see if there were any "File system withdrawn" messages, general protection
faults, or kernel panic messages or other serious kernel errors on any
of the nodes in the cluster just around the time of the first failure.

This is just a wild guess, but I'm guessing that there was some kind
of error, like a kernel panic that occurred a while back.  That caused
the node to be fenced.  Perhaps the SCSI fencing locked up the device
somehow so none of the nodes can use it.  If that's the case, you
should be able to log in to each of the nodes, unmount the gfs file
systems that are mounted, manually, and then reboot them.
If it doesn't let you unmount them, it might be because some process
is still using the GFS file system.  For example, if you're using
NFS to export the GFS file system, you probably need to do
service nfs stop before it will let you unmount the gfs, then reboot.

So I would comb through the /var/log/messages of each node looking
for an error message regarding the node being fenced, withdrawn, panic,
SCSI errors, or any kind of serious errors that occurred around the
time where you first had the problem.

Regards,

Bob Peterson
Red Hat Clustering & GFS