[Linux-cluster] Graceful Degradation


I've got most of my cluster pretty much sorted out, apart from kicking nodes from the cluster when they fail.

Is there a way to make the node-kicking automated? I have 4 nodes. They are sharing 2 GFS file systems, a root FS and a data FS. If I pull the network cable from one of them, or just power it off, the rest of the cluster nodes just stop. The only way to get them to start responding again is to bring the missing node back, even if there are still enough nodes to maintain quorum (3 nodes out of 4).

Can anyone suggest a way around this? How can I make the 3 remaining nodes just kick the missing node out of the cluster and DLM group (possibly after some timeout, e.g. 10 seconds) and resume operation until the node rejoins?

This may or may not be related to the fact that I'm running a shared GFS root, but any pointers would be welcome.



