[Linux-cluster] Quorum disks and two-node clusters

Wed Oct 18 12:10:51 UTC 2006

On Wed, Oct 18, 2006 at 09:32:08AM +0200, Pena, Francisco Javier wrote:
> Hi Lon,
> 
> After doing some additional checks with my test environment, I think I was too fast in assuming this would be absolutely required. I assumed that having the failed node reboot itself would eliminate the need to fence that node, but it looks like this is not the case. Which makes me wonder, why do we want the server to reboot itself, if it is going to be fenced anyway?
>

In the case that you have a seperate network for the heartbeat interface.  If a
node loses its heartbeat connection, it will assume the rest of the cluster is
out and will go down swinging, so you run the risk of having good nodes fenced.
With qdisk, if you have the hueristics setup right, it will recognize that the
node that has lost its heartbeat connection is the bad one and reboot itself,
without trying to fence the other good nodes.

> I we could avoid the fencing the failed node, we would be able to solve some problems I found with iLO fencing: if a node loses power completely, the iLO card will not work, so we will never be able to fence the failed node, and the whole cluster will be stopped. If we can assume that an inquorate node will inmediately reboot, we might continue working without any manual interaction.
> 

This is a problem.  You may want to consider adding a second fence level with
manual fencing.  If you cannot connect to the iLo interface chances are the box
is truly down and you don't have to worry about the implications of using manual
fencing, and it will allow the cluster to continue working, you will just have
to remember to do fence_ack_manual on the node that did the fencing.

Josef