[Linux-cluster] fencing: external vs watchdog

Fri Aug 17 16:33:34 UTC 2007

On Fri, Aug 17, 2007 at 09:29:05AM +0200, Mark Hlawatschek wrote:
> Hi,
> 
> I'd like to discuss and collect information about the two diffrent fencing 
> approaches.
> 
> external fencing: The failed cluster node is disconnected from the storage 
> device by onother node in the cluster. After a failure detection all cluster 
> activities are suspended until the IO fencing of the failed node has been 
> completed successfully.
> 
> watchdog fencing: A failed cluster node has to recognize the failure by itself 
> and will be shut down by a kind of internal watchdog feature.
> 
> Now, I see that theoretically the external fencing method (when configured 
> correctly) is the betterer approach because of the exactly defined state 
> during a fencing and recovery operation.

> But the question is: What are real world examples of failures when the 
> watchdog fencing would fail and cause data corruption on the storage device ?
> I'd like to collect some real world examples and also theoretical approaches.

Hardware watchdog failure.

Kernel watchdog thread failure.

Wrong watchog time (timer fires after failover attempt when node isn't
really hung).

With linux-cluster, historically, we have required verification from a
3rd party (IE the fence device) - and not assumed anything.  That is,
fencing devices are assumed to be set up correctly - and therefore, when
we ask the fence device if this port (node, whatever) is off, it can
say "Yes" or "No".

Fencing goes like this:

  do {
    sleep(5);
    fence_node();
  } while (!node_is_fenced() && !node_rejoined());

With a watchdog timer, there's no fencing and no verification.

Red Hat Cluster Suite 3 was the last instance of linux-cluster / RHCS
which supported use of watchdog timers (based on assumptions), but it
wasn't considered fencing. (In RHEL 2.1 you added an actual "watchdog"
agent to the cluster config, which was a no-op, how useful...)

Data integrity after a failure was disclaimed if watchdogs were used.

I can't think of any real-world problems that resulted, but I forget
lots of things ;)

-- 
Lon Hohberger - Software Engineer - Red Hat, Inc.