[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [Linux-cluster] RE: Fencing quandry



Thanks for the response, James. Unfortunately, it doesn't fully answer
my question or at least, I'm not following the logic. The bug report
would seem to indicate a problem with using the default "reboot" method
of the agent. The work around simply replaces the single fence device
('reboot') with 2 fence devices ('off' followed by 'on') in the same
fence method. If the server fails to power on, then, according to the
FAQ, fencing still fails ("All fence devices within a fence method must
succeed in order for the method to succeed").

I'm back to fenced being a SPoF if hardware failures prevent a fenced
node from powering on.

--Jeff
Performance Engineer

OpSource, Inc.
http://www.opsource.net
"Your Success is Our Success"
  

> -----Original Message-----
> From: linux-cluster-bounces redhat com 
> [mailto:linux-cluster-bounces redhat com] On Behalf Of 
> Hofmeister, James (WTEC Linux)
> Sent: Tuesday, October 14, 2008 1:40 PM
> To: linux clustering
> Subject: [Linux-cluster] RE: Fencing quandry
> 
> Hello Jeff,
> 
> I am working with RedHat on a RHEL-5 fencing issue with 
> c-class blades...  We have bugzilla 433864 opened for this 
> and my notes state to be resolved in RHEL-5.3.
> 
> We had a workaround in the RHEL-5 cluster configuration:
> 
>   In the /etc/cluster/cluster.conf
> 
>   *Update version number by 1.
>   *Then edit the fence device section for "each" node for example:
> 
>                         <fence>
>                                 <method name="1">
>                                         <device name="ilo01"/>
>                                 </method>
>                         </fence>
>   change to  -->
>                         <fence>
>                                 <method name="1">
>                                         <device name="ilo01" 
> action="off"/>
>                                         <device name="ilo01" 
> action="on"/>
>                                 </method>
>                         </fence>
> 
> Regards,
> James Hofmeister
> Hewlett Packard Linux Solutions Engineer
> 
> 
> 
> |-----Original Message-----
> |From: linux-cluster-bounces redhat com
> |[mailto:linux-cluster-bounces redhat com] On Behalf Of Jeff Stoner
> |Sent: Tuesday, October 14, 2008 8:32 AM
> |To: linux clustering
> |Subject: [Linux-cluster] Fencing quandry
> |
> |We had a "that totally sucks" event the other night 
> involving fencing.
> |In short - Red Hat 4.7, 2 node cluster using iLO fencing 
> with HP blade
> |servers:
> |
> |- passive node detemined active node was unresponsive 
> (missed too many
> |heartbeats)
> |- passive node initiates take-over and begins fencing process
> |- fencing agent successfully powers off blade server
> |- fencing agent sits in an endless loop trying to power on the
> |blade, which won't power up
> |- the cluster appears "stalled" at this point because fencing
> |won't complete
> |
> |I was able to complete the failover by swapping out the
> |fencing agent with a shell script that does "exit 0". This
> |allowed the fencing agent to complete so the resource manager
> |could successfully relocate the service.
> |
> |My question becomes: why isn't a successful power off
> |considered sufficient for a take-over of a service? If the
> |power is off, you've guaranteed that all resources are
> |released by that node. By requiring a successful power on
> |(which may never happen due to hardware failure,) the fencing
> |agent becomes a single point of failure in the cluster. The
> |fencing agent should make an attempt to power on a down node
> |but it shouldn't hold up the failover process if that attempt fails.
> |
> |
> |
> |--Jeff
> |Performance Engineer
> |
> |OpSource, Inc.
> |http://www.opsource.net
> |"Your Success is Our Success"
> |
> |
> |--
> |Linux-cluster mailing list
> |Linux-cluster redhat com
> |https://www.redhat.com/mailman/listinfo/linux-cluster
> |
> 
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]