[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] What if the fence device doesn't work?

Janne Peltonen wrote:

I started wondering what happens if my fence device is broken. The

 -a node (running a service) fails
 -another node notices the lost heartbeats and tries to fence the failed
 -however, the fence device doesn't respond
 -...what now?

I tried to simulate the situation with our test cluster of two HP Blade
servers, using iLO fencing, by misconfiguring the fencing agent to use a
wrong username to authenticate to the iLO. What happens is, the fenced
on the running node tries to fence the failed node over and over again,
and the service I'm trying to fail over will never leave state "Started"
on node "Unknown"... that is, the cluster won't fail it over to the
running node.

Not good.
Actually, it is good. A node failure comes in many shapes and sizes, from a full system failure (where the whole machine is powered off) to a partial failure (where only the NIC used for heartbeat failed, but not the OS or disk controllers) If only the NIC fails, your service is still running, still updating the hard drive, and still generally running correctly, but it's not able to send heartbeats.

Now, if the other system trys to take over the service, and assumes that the failed node is offline, then it will mount the drive, start the service, and since two systems both have the same non-clustered filesystem mounted read-write they will corrupt it pretty quickly. Which is what fencing is designed to prevent.

So to keep that scenario from happening, the cluster software ensures that a successful fence occurs before continuing operation. It's a fail-safe style setup. Better to take 30 minutes downtime for an admin to make the right decision than corrupt your filesystems and have to take 8 -24 hours downtime to restore the system.

Eric Kerin
eric bootseg com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]