[Linux-cluster] fence_manual node failure clarification

Dan B. Phung phung at cs.columbia.edu
Thu May 12 19:07:15 UTC 2005


My question is in reference to node failures using fence_manual
>From 'man fenced'

  Node failure
  When a domain member fails, the actual fencing must be completed before
  GFS recovery can begin.  This means any delay in carrying out the 
  fencing operation will also delay the completion of GFS file system
  operations; most file system operations will hang during this period.

So this is what I'm seeing now when a node fails, ie. the rest of the
nodes notice that the heartbeats of a certain node A has timed out. Node A
is fenced by ther remaining nodes, and the file system is hung.  My
questions are:

1) can I call fence_ack_manual right when I see that node A is fenced, or
do I have to wait for node A to reboot, come back, and join the cluster?

2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely
for the failed node to rejoin the cluster, which it seems to be doing, 
so is this the default?  The man page shows:
  <fence_daemon post_fail_delay="0">

So with my assumption of the delay being 0, I expected the node to be
fenced instantly on timeout, recovery to begin and complete, and my file
system for the rest of the nodes to be usable in a relatively short time.
I guess if the answer to 1) is that this recovery is done manually with
the fence_ack_manual, then it all makes sense.

thanks,
dan

-- 





More information about the Linux-cluster mailing list