[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Node Failure Detection Problems



Hi,

I have some questions on configuring and tuning heartbeats and node-failure detection.

I have a 2-node cluster. Whenever a node fails it seems to take a while to detect node failure.

First question: I have reduced heartbeat hello_timer to 1 second, and deadnode_timeout to 5 seconds. Is there an elegant way to do this with cluster.conf? Currently I'm setting /proc/cluster/config/cman/hello_timer with an init script hack.

Failure is detected by cman within 5 seconds, no problem, but clustat hangs during this time.

Second question: clustat continues to hang for around 10 more seconds - 15 in total, before clurgmgrd does a state change.

Does anyone know where this additional 10 seconds comes from? Is it configurable?

Here is the system log for the transition:
>>>
Mar 19 21:01:33 firthy kernel: CMAN: removing node emsy from the cluster : Missed too many heartbeats Mar 19 21:01:33 firthy fenced[1878]: emsy not a cluster member after 0 sec post_fail_delay
Mar 19 21:01:33 firthy fenced[1878]: fencing node "emsy"
Mar 19 21:01:35 firthy fenced[1878]: fence "emsy" success
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> Magma Event: Membership Change
Mar 19 21:01:44 firthy clurgmgrd[3347]: <info> State change: emsy DOWN
<<<

Many thanks,
James Firth



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]