Re: [Linux-cluster] Fenced failing continuously

Thanks for the info, I will be sure to have our monitor watch the iLO ports...

I've done some testing with fence_ilo and haven't seen a lengthy failover time. I'm running the Python script that is part of the Clustering group. Is that the user contributed one? My testing right now has failover done in a few seconds.

On Mon, Apr 13, 2009 at 2:08 PM, Robert Hurst <rhurst bidmc harvard edu> wrote:
You're right about there is no such thing as fail-safe ... but I would worry more if I just hard-code a return value of SUCCESS in my scripts.  Management cards are supposed to work, even if they are powered down -- not that there is a loss of power to both lines.  If that is the case, no electricity == no servers == no cluster, which means you are doing a cold boot regardless.

We have both fence_ilo and fence_bladecenter in effect.  As good as the iLO cards have performed to date, we are still moving off HP DL385s into IBM BladeCenter because its management processors are closer to fault tolerant than anything else we have experienced.  I have had HP iLO cards "crash" and not reset themselves -- although later firmware revisions have reduced those outages greatly.  Monitoring its https and ssh ports for availability are a requirement!

There is user-contributed fence_ilo patch listed somewhere in this list worth investigating -- it runs A LOT FASTER than the stock one.  AFAIK, the fence_ilo does not use ssh, but a sort of web soap services call via https.  We have seen in production and testing that a typical fencing operation using fence_ilo is 42-seconds, and a good percentage of time, up to twice as long as that.  The bladecenter fencing operations we have seen occur in under 7-seconds.

