[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] RE: Fencing quandry

Hello Jeff,

I am working with RedHat on a RHEL-5 fencing issue with c-class blades...  We have bugzilla 433864 opened for this and my notes state to be resolved in RHEL-5.3.

We had a workaround in the RHEL-5 cluster configuration:

  In the /etc/cluster/cluster.conf

  *Update version number by 1.
  *Then edit the fence device section for "each" node for example:

                                <method name="1">
                                        <device name="ilo01"/>
  change to  -->
                                <method name="1">
                                        <device name="ilo01" action="off"/>
                                        <device name="ilo01" action="on"/>

James Hofmeister
Hewlett Packard Linux Solutions Engineer

|-----Original Message-----
|From: linux-cluster-bounces redhat com
|[mailto:linux-cluster-bounces redhat com] On Behalf Of Jeff Stoner
|Sent: Tuesday, October 14, 2008 8:32 AM
|To: linux clustering
|Subject: [Linux-cluster] Fencing quandry
|We had a "that totally sucks" event the other night involving fencing.
|In short - Red Hat 4.7, 2 node cluster using iLO fencing with HP blade
|- passive node detemined active node was unresponsive (missed too many
|- passive node initiates take-over and begins fencing process
|- fencing agent successfully powers off blade server
|- fencing agent sits in an endless loop trying to power on the
|blade, which won't power up
|- the cluster appears "stalled" at this point because fencing
|won't complete
|I was able to complete the failover by swapping out the
|fencing agent with a shell script that does "exit 0". This
|allowed the fencing agent to complete so the resource manager
|could successfully relocate the service.
|My question becomes: why isn't a successful power off
|considered sufficient for a take-over of a service? If the
|power is off, you've guaranteed that all resources are
|released by that node. By requiring a successful power on
|(which may never happen due to hardware failure,) the fencing
|agent becomes a single point of failure in the cluster. The
|fencing agent should make an attempt to power on a down node
|but it shouldn't hold up the failover process if that attempt fails.
|Performance Engineer
|OpSource, Inc.
|"Your Success is Our Success"
|Linux-cluster mailing list
|Linux-cluster redhat com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]