[Linux-cluster] What if the fence device doesn't work?

Tue Nov 21 13:26:20 UTC 2006

Janne Peltonen wrote:
> Hi!
>
> I started wondering what happens if my fence device is broken. The
> scenario:
>
>  -a node (running a service) fails
>  -another node notices the lost heartbeats and tries to fence the failed
>  node
>  -however, the fence device doesn't respond
>  -...what now?
>
> I tried to simulate the situation with our test cluster of two HP Blade
> servers, using iLO fencing, by misconfiguring the fencing agent to use a
> wrong username to authenticate to the iLO. What happens is, the fenced
> on the running node tries to fence the failed node over and over again,
> and the service I'm trying to fail over will never leave state "Started"
> on node "Unknown"... that is, the cluster won't fail it over to the
> running node.
>
> Not good. 
Actually, it is good.  A node failure comes in many shapes and sizes, 
from a full system failure (where the whole machine is powered off)  to 
a partial failure (where only the NIC used for heartbeat failed, but not 
the OS or disk controllers)   If only the NIC fails, your service is 
still running, still updating the hard drive, and still generally 
running correctly, but it's not able to send heartbeats.

    Now, if the other system trys to take over the service, and assumes 
that the failed node is offline, then it will mount the drive, start the 
service, and since two systems both have the same non-clustered 
filesystem mounted read-write they will corrupt it pretty quickly.  
Which is what fencing is designed to prevent.

    So to keep that scenario from happening, the cluster software 
ensures that a successful fence occurs before continuing operation.  
It's a fail-safe style setup. Better to take 30 minutes downtime for an 
admin to make the right decision than corrupt your filesystems and have 
to take 8 -24 hours downtime to restore the system.

Thanks,
Eric Kerin
eric at bootseg.com