[Linux-cluster] Re: Using qdisk and a watchdog timer to eliminate power fencing single point of failure?

Lon Hohberger wrote:
I've got a deployment scenario for a two node cluster (where services are configured as active and standby) where the customer is concerned that the external power fencing device I am using (WTI) becomes a single point of failure. If the WTI for the active node dies, taking down the active node, the standby cannot bring up services because it cannot successfully fence the failed node. This leaves the cluster down.

Correct.  Although, if you plug in a serial terminal server, I have a
patch to talk to WTI switch through a terminal server in case the server
gets unjacked, though.

Actually, I'm more worried about a WTI that blows up, taking the active node with it. A terminal server won't help with that.

In the setup, storage fencing is not feasible as a backup for power fencing.

Not even using fence_scsi? (SCSI3 reservations)?  That's unfortunate :(

Well, it's possible, but this solution may be deployed with many different SAN implementations, so I was hoping to find a way to avoid having to certify that each SAN does SCSI reservations correctly.

I think I've worked out a scenario using qdiskd and the internal hardware watchdog timers in our nodes to use as a backup for power fencing that I hope will eliminate the single point of failure.

Hardware watchdog timers = good stuff.

Here's how I see it working:

2. Create a heuristic (besides the usual network reachability test) for qdisk that resets the node's hardware watchdog timer. (I'll have to do some additional work to ensure that the watchdog gets turned off if I am gracefully shutting down the node's qdisk daemon.)

There's a watchdog daemon (userspace code) that lets you configure
heuristics for it.  Most are internal to it - and are therefore superior
to how qdiskd does heuristics from a HA / memory-neutrality perspective.
If some heuristic(s) are not met, the daemon can at your option stop
touching the watchdog device.

There's an open bugzilla to provide an integration path between qdiskd
and watchdogd - so that you can configure heuristics for watchdogd and
have qdiskd base its state on those.

For example, if watchdogd says "ok, we're not updating the watchdog
driver because of X", qdiskd can trigger a self-demotion off of that, or
maybe even write a 'If you don't hear from me in X seconds, consider me
dead' message to disk...?

That looks like good stuff, I'll look into it. From looking at watchdogd, it can monitor if a file gets updated, so it's easy to integrate quorumd and watchdogd in a simple fashion by just having a quorumd heuristic that touches a file.

3. Create a custom fencing script that is run if power fencing fails that examines qdisk's state to see if the node that needs to be fenced is no longer updating the quorum disk.

I think the easiest thing to do is make a quick, small-footprint API or
utility to talk to qdiskd to get states...

That's what I figured.

(I'm not sure how to do this--I hope that the information in stored in qdisk's status_file will be sufficient to determine this, if not, I might have to modify qdisk to supply what I need.)

... because status_file is *sketchy* at best (really, it's a debugging
tool). ;)

I was afraid of that...

The standby node then should be sure that the active node has rebooted itself either by qdiskd's action or via the watchdog timer, or else it is power dead.

Can anyone see a weakness in this approach I haven't thought of?

It's good from a best-effort standpoint.  We don't have anything that
does 'best effort' fencing - it's mostly all black/white.

A question that comes up is: if we use the watchdog + watchdog daemon,
do we need qdisk at all?  I mean, if there's an 'eventual timeout'
anyway based on the expectancy that the watchdog timer will fire and we
rely on it - why bother with the intermediate steps?

Hardware watchdog timers are going to be more reliable than just about
anything qdiskd could provide.

Ok, I get it. It's probably a couple of orders of magnitude more reliable, but since it relies only on timing, there's no real *positive* indication that the fencing succeeded, so it's really only best-effort. Even though it would take three failures (network disruption of heartbeat, quorumd failing to reboot the node and the watchdog timer failing as well), there's still a slim, slim chance that the node is still trying to write to the SAN. If I want to guarantee that there's never a split brain, then this isn't good enough.

Thanks for the advice.

Jon Biggar
Floorboard Software
jon floorboard com
jon biggar org

