[Linux-cluster] RE: Cluster Suite 4 failover problem

HI  Jeff & Lon,

Thanks for the reply.

Regarding the didn't failover issue (just displayed the "Owner --> unknown" and "State --> started" but actually none services were available), i checked the log and agreed that it should be the fence_manual problem. It is because the log message showed that the fence_manul was waiting node2 to rejoin the cluster, as soon as i executed the command: fence_ack_manual -n node2, the failed services failover to node1, all failed service back to normal.

I would like to know if there is any solution or workaround for this situation other than buying a fence device :) ????? Can i remove the fence.rpm ??? Will it cause any extra problems????? It is because in production environment, we never know when will the machine down and cannot execute the fence_ack_manual command immediately.

kernel: CMAN: removing node node2 from the cluster : Missed too many heartbeats
fenced[2447]: node2 not a cluster member after 0 sec post_fail_delay
fenced[2447]: fencing node "node2"
fence_manual: Node node2 needs to be reset before recovery can procede. Waiting for node2 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n node2)

Regarding the monitor_link issue, i have tried to set the "monitor_link =1 " for both resource ip i.e. and , then i shutdown eth0 of node2 and re-enable it, when i tried to restart the rgmanager in node2 i.e. the failed node, it still showing the msg "Shutting down Cluster Service Manager... Waiting for services to stop: ", i have to kill the rgmanager's processes or even worse i have to reset the machine. Any ideas??

One more thing is even the monitor_link=0 in the cluster.conf, the system-config-cluster --> Resource --> IP address's Monitor Link box is being ticked!!! Why??

Many thanks,

