[Linux-cluster] What if the fence device doesn't work?

Mon Nov 27 21:49:48 UTC 2006

On Tue, 2006-11-21 at 08:59 +0200, Janne Peltonen wrote:
> Hi!
> 
> I started wondering what happens if my fence device is broken. The
> scenario:
> 
>  -a node (running a service) fails
>  -another node notices the lost heartbeats and tries to fence the failed
>  node
>  -however, the fence device doesn't respond
>  -...what now?

Fencing retries forever.  You can build redundant fencing if you're
worried about it.

> Not good. If the active node fails, and the fence device fails at the
> same time - for example, if the active node is a Xen guest and the host
> Xen fails, or if the active node loses power because the network power
> switch fails or because the iLO gets confused - the service is lost.
> The Xen scenario doesn't even seem too far-fetched...

[except for VMs; see below] This is an unrecoverable double failure -
because there is no certainty as to the cause.  For example, if your
power switch loses power, it appears exactly the same to the cluster as
unplugging the network cable to both the node and the power switch.

We solve the virtual machine situation by:

(a) requiring that the host nodes where the VM cluster resides to be a
member of a cluster and have fencing of their own, and
(b) storing the last-known location of the VM in an AIS checkpoint

If the VM crashes, we simply ask the host cluster to fence the VM.  The
owner of the VM responds, and issues the equivalent of 'xm destroy'.

If the physical node has crashed, the physical cluster will notice the
physical node has crashed and kill that node.  When a fencing request
comes in for a VM which was previously running on that node, another
node in the physical cluster can then respond that the VM has also been
fenced (because the cluster knows the last known location of the VM, and
that node has been fenced).

-- Lon