[Linux-cluster] RE: Fencing quandry

jim parsons jparsons at redhat.com
Wed Oct 15 21:38:11 UTC 2008


On Wed, 2008-10-15 at 20:45 +0000, Hofmeister, James (WTEC Linux) wrote:
> Hello Jeff,
> 
> RE: [Linux-cluster] RE: Fencing quandary
> 
> The root issue is the ILO scripts are not up to date with the current firmware rev in the c-class and p-class blades.
> 
> The method of '<device name="ilo01"/>' for a "reboot" is not working with this ILO firmware rev and the workaround is to send 2 commands to ILO under a single method... 'action="off"/' and 'action="on"/'.
> 
> I had tested this with my p-class blades and it was successful.  I am still waiting for my customers test results on their c-class blades.
> 
> ...yes this is the root issue to the ILO problem, but it does not completely address your concern.  I believe you are saying: That the RHCS does not accept a "power off" as a fence, but is requiring both "power off" followed by "power on".
Right. It is failing because the 'power on' portion is not completing
because the fence agent is unable to send the correct power on command.

With all due respect to HP's iLO, along with DRAC, RSA, RSB, etc,
keeping up wee little delta's between firmware versions of baseboard
management devices is challenging. Please pull down the very latest
version of the agent and try it. For the time being, you could just use
the power off command and walk over and turn it back on if it is
convenient :). You could also run the agent from the command line with
the verbose output switch set (man fence_ilo) and see if you can
determine why the command is failing. Post what you find here. The agent
is written in Perl and pretty easy to understand I think, if you are
adventurous.

The upcoming 5.3 ilo agent has been rewritten to include additional
connection types, and is being heavily tested now on many firmware
versions. The beta is close to release. Grab it when you can.
-j
> 
> Regards,
> James Hofmeister
> Hewlett Packard Linux Solutions Engineer
> 
> |-----Original Message-----
> |From: linux-cluster-bounces at redhat.com
> |[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
> |Sent: Tuesday, October 14, 2008 3:43 PM
> |To: linux clustering
> |Subject: RE: [Linux-cluster] RE: Fencing quandry
> |
> |Thanks for the response, James. Unfortunately, it doesn't
> |fully answer my question or at least, I'm not following the
> |logic. The bug report would seem to indicate a problem with
> |using the default "reboot" method of the agent. The work
> |around simply replaces the single fence device
> |('reboot') with 2 fence devices ('off' followed by 'on') in
> |the same fence method. If the server fails to power on, then,
> |according to the FAQ, fencing still fails ("All fence devices
> |within a fence method must succeed in order for the method to
> |succeed").
> |
> |I'm back to fenced being a SPoF if hardware failures prevent a
> |fenced node from powering on.
> |
> |--Jeff
> |Performance Engineer
> |
> |OpSource, Inc.
> |http://www.opsource.net
> |"Your Success is Our Success"
> |
> |
> |> -----Original Message-----
> |> From: linux-cluster-bounces at redhat.com
> |> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Hofmeister,
> |> James (WTEC Linux)
> |> Sent: Tuesday, October 14, 2008 1:40 PM
> |> To: linux clustering
> |> Subject: [Linux-cluster] RE: Fencing quandry
> |>
> |> Hello Jeff,
> |>
> |> I am working with RedHat on a RHEL-5 fencing issue with c-class
> |> blades...  We have bugzilla 433864 opened for this and my
> |notes state
> |> to be resolved in RHEL-5.3.
> |>
> |> We had a workaround in the RHEL-5 cluster configuration:
> |>
> |>   In the /etc/cluster/cluster.conf
> |>
> |>   *Update version number by 1.
> |>   *Then edit the fence device section for "each" node for example:
> |>
> |>                         <fence>
> |>                                 <method name="1">
> |>                                         <device name="ilo01"/>
> |>                                 </method>
> |>                         </fence>
> |>   change to  -->
> |>                         <fence>
> |>                                 <method name="1">
> |>                                         <device name="ilo01"
> |> action="off"/>
> |>                                         <device name="ilo01"
> |> action="on"/>
> |>                                 </method>
> |>                         </fence>
> |>
> |> Regards,
> |> James Hofmeister
> |> Hewlett Packard Linux Solutions Engineer
> |>
> |>
> |>
> |> |-----Original Message-----
> |> |From: linux-cluster-bounces at redhat.com
> |> |[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Jeff Stoner
> |> |Sent: Tuesday, October 14, 2008 8:32 AM
> |> |To: linux clustering
> |> |Subject: [Linux-cluster] Fencing quandry
> |> |
> |> |We had a "that totally sucks" event the other night
> |> involving fencing.
> |> |In short - Red Hat 4.7, 2 node cluster using iLO fencing
> |> with HP blade
> |> |servers:
> |> |
> |> |- passive node detemined active node was unresponsive
> |> (missed too many
> |> |heartbeats)
> |> |- passive node initiates take-over and begins fencing process
> |> |- fencing agent successfully powers off blade server
> |> |- fencing agent sits in an endless loop trying to power on
> |the blade,
> |> |which won't power up
> |> |- the cluster appears "stalled" at this point because fencing won't
> |> |complete
> |> |
> |> |I was able to complete the failover by swapping out the
> |fencing agent
> |> |with a shell script that does "exit 0". This allowed the fencing
> |> |agent to complete so the resource manager could
> |successfully relocate
> |> |the service.
> |> |
> |> |My question becomes: why isn't a successful power off considered
> |> |sufficient for a take-over of a service? If the power is
> |off, you've
> |> |guaranteed that all resources are released by that node. By
> |requiring
> |> |a successful power on (which may never happen due to hardware
> |> |failure,) the fencing agent becomes a single point of
> |failure in the
> |> |cluster. The fencing agent should make an attempt to power
> |on a down
> |> |node but it shouldn't hold up the failover process if that attempt
> |> |fails.
> |> |
> |> |
> |> |
> |> |--Jeff
> |> |Performance Engineer
> |> |
> |> |OpSource, Inc.
> |> |http://www.opsource.net
> |> |"Your Success is Our Success"
> |> |
> |> |
> |> |--
> |> |Linux-cluster mailing list
> |> |Linux-cluster at redhat.com
> |> |https://www.redhat.com/mailman/listinfo/linux-cluster
> |> |
> |>
> |> --
> |> Linux-cluster mailing list
> |> Linux-cluster at redhat.com
> |> https://www.redhat.com/mailman/listinfo/linux-cluster
> |>
> |>
> |
> |--
> |Linux-cluster mailing list
> |Linux-cluster at redhat.com
> |https://www.redhat.com/mailman/listinfo/linux-cluster
> |
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list