[Linux-cluster] fence_manual node failure clarification
Dan B. Phung
phung at cs.columbia.edu
Thu May 12 19:07:15 UTC 2005
My question is in reference to node failures using fence_manual
>From 'man fenced'
Node failure
When a domain member fails, the actual fencing must be completed before
GFS recovery can begin. This means any delay in carrying out the
fencing operation will also delay the completion of GFS file system
operations; most file system operations will hang during this period.
So this is what I'm seeing now when a node fails, ie. the rest of the
nodes notice that the heartbeats of a certain node A has timed out. Node A
is fenced by ther remaining nodes, and the file system is hung. My
questions are:
1) can I call fence_ack_manual right when I see that node A is fenced, or
do I have to wait for node A to reboot, come back, and join the cluster?
2) if I set the post_fail_delay to -1, the fence daemon waits indefinitely
for the failed node to rejoin the cluster, which it seems to be doing,
so is this the default? The man page shows:
<fence_daemon post_fail_delay="0">
So with my assumption of the delay being 0, I expected the node to be
fenced instantly on timeout, recovery to begin and complete, and my file
system for the rest of the nodes to be usable in a relatively short time.
I guess if the answer to 1) is that this recovery is done manually with
the fence_ack_manual, then it all makes sense.
thanks,
dan
--
More information about the Linux-cluster
mailing list