[Linux-cluster] two fencing problems

Lon Hohberger lhh at redhat.com
Tue Dec 20 20:34:09 UTC 2005


On Tuesday 20 December 2005 14:22, Greg Forte wrote:
> Lon Hohberger wrote:
> >
> >    <device name="FENCE1" option="off" port="1"/>
> >    <device name="FENCE2" option="reboot" port="1"/>
> >    <device name="FENCE1" option="on" port="1"/>
> >
> > ...but that's about as "optimal" as you can get while still being safe.
>
> Thinking about this a bit further, how is the second example any better
> than the first?  If fenced hangs after issuing the "off" to FENCE1 in
> your conf, but before or during issuing the reboot to FENCE2, how is
> that different than it hanging between issuing the two reboots in mine?

Sorry, I was not very clear...

A power switch "rebooting" a port means turning that port off, then on, 
optionally after some delay.

If you hang between two "reboot" operations and recover a few seconds later, 
the second reboot cycle can occur after the first reboot cycle has completed.  
That is - the first power outlet has power restored prior to the second 
outlet being turned off.  Fencing has succeeded as configured (with some 
delay for the hang), but the host has never lost power.  This is dangerous.

In the "off-reboot-on" case, if you hang between "off" (occurring first) and 
"reboot" and recover a few seconds later, the first outlet is still off when 
the second operation ("reboot") occurs.  Fencing has succeeded as configured, 
and the host has lost power.

Similarly, with the "off-off-on-on" case, if you hang between two "off" 
operations, the first port will still be off when the second "off" operation 
occurs, so the host loses power like it should.

Put simply, if we expand the "reboot" operation to what it really is to a 
power switch - "off-on" - we end up with the following:

"reboot-reboot" ==> "(off-on)-(off-on)" ==> bad!
"off-reboot-on" ==> "off-(off-on)-on" ==> good

>  Aside from the fact that mine (in theory) leaves both power outlets on,
> whereas yours leaves one off, isn't the net effect that the node didn't
> get fenced but the cluster thinks it did?  The same argument applies to
> then "off","off","on","on" configuration that I'd just as soon use.

off-off-on-on is the least ambiguous/easiest to explain plainly: both "off" 
operations succeed before either "on" operation occurs, or fencing fails.

In the case where fencing fails, the cluster will retry the operation.  If 
fencing hangs, I'm not sure what happens, but someone here knows ;)

> ... it appears to have
> clobbered my "illegal" fence sections that it didn't like.  How would
> one go about controlling (restarting, disabling) cluster services
> without the gui?

Your fencing section *was* illegal :) , but I do not think it should have 
clobbered (e.g. removed it).  If there are other problems, please file a 
bugzilla.  

See clustat(8), clusvcadm(8) for service control.

-- Lon




More information about the Linux-cluster mailing list