[Linux-cluster] failed node causes all GFS systems to hang

Thu Jun 9 03:04:10 UTC 2005

On Wed, Jun 08, 2005 at 05:46:26PM -0400, Dan B. Phung wrote:

> I think I'm doing something terribly wrong here, because if one of my
> nodes goes down, the rest of the nodes connected to GFS are hung in some
> wait state.  Specifically, only those nodes running fenced are hosed.
> These machines are not only blocked on the GFS's file system, but the
> local file system stuff is hung as well, which requires me to reboot
> everybody connected to GFS.  I have one node not running fenced to reset
> the quorum status, so that doesn't seem to be the problem.  
> 
> I updated from the cvs sources -rRHEL4 last friday, so I have up to date
> stuff.  i'm running kernel 2.6.9 and fence_manual.  I remember a couple
> of weeks back that when a node went down, I simply had to
> fence_ack_manual the node, but that message never comes up anymore...

The joys of manual fencing, we do debate sometimes whether it's more
troublesome than helpful for people.

When a node fails, you need to run fence_ack_manual on one of the
remaining nodes, specifically, whichever remaining node has a fence_manual
notice in /var/log/messages.  So, you need to monitor /var/log/messages on
the remaining nodes to figure out where you need to run fence_ack_manual
(it will generally be the remaining node with the lowest nodeid, see
cman_tool nodes).

If the failed node caused the cluster to loose quorum, then it's a
different story.  In that case you need to get some nodes back into your
cluster (cman_tool join) to regain quorum before any kind of fencing will
happen.

GFS is going to be blocked everywhere until you run fence_ack_manual for
the failed node.  If there are no manual fencing notices anywhere for the
failed node, then maybe you lost quorum (see cman_tool status), or
something else is wrong.  I don't know why your local fs would be hung.

Dave