[Linux-cluster] Node Failure Detection Problems

Sun Mar 19 21:07:03 UTC 2006

Hi,

I have some questions on configuring and tuning heartbeats and 
node-failure detection.

I have a 2-node cluster.  Whenever a node fails it seems to take a while 
to detect node failure.

First question: I have reduced heartbeat hello_timer to 1 second, and 
deadnode_timeout to 5 seconds.  Is there an elegant way to do this with 
cluster.conf?  Currently I'm setting 
/proc/cluster/config/cman/hello_timer with an init script hack.

Failure is detected by cman within 5 seconds, no problem, but clustat 
hangs during this time.

Second question: clustat continues to hang for around 10 more seconds - 
15 in total, before clurgmgrd does a state change.

Does anyone know where this additional 10 seconds comes from?  Is it 
configurable?

Here is the system log for the transition:
 >>>
Mar 19 21:01:33 firthy kernel: CMAN: removing node emsy from the cluster 
: Missed too many heartbeats
Mar 19 21:01:33 firthy fenced[1878]: emsy not a cluster member after 0 
sec post_fail_delay
Mar 19 21:01:33 firthy fenced[1878]: fencing node "emsy"
Mar 19 21:01:35 firthy fenced[1878]: fence "emsy" success
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> Magma Event: Membership 
Change
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> State change: emsy DOWN
<<<

Many thanks,
James Firth