[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Problems with Cluster

On 6/12/07, Marc Grimme <grimme atix de> wrote:
On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote:
> On 6/11/07, Robert Gil <Robert Gil americanhm com> wrote:
> > If ilo itself is off, fencing doesn't work.
> Isn't there any timeout setting such that if the ILO doesn't respond
> for a certain amount of time, it is treated as fenced and the node is
> considered to be dead and the failover takes place?
As far as I remember there is only a tcp-timeout when establishing the
connection to the ilo-card that takes a very long time to occure (that's a
default setting and takes minutes). I'm not sure how and where to set it.

We did wait for quite some time and followed the messages appearing in
/var/log/messages. It kept on trying to contact the ILO of the node
which was powered off.

But we've had this discussion (especially with ILO-Cards) nearly every time
when using them and therefore and also out of other reasons we had to build
our own fence_ilo agent. I'm quite sure that we solved the timeout problem in
the end. It is set to 10sec per default (Config.timeout).
You can find it at
or directly use the yum/up2date-channel as described here:
then install "comoonics-bootimage-fenceclient-ilo" and there you go.

Thanks, I will try and see if they agree to use this version.

> > Did you add ilo as a fence device? And create a user? You create a user
> > in the ilo for that blade, not on the chassis. You have to reboot the
> > blade to get to the ilo manager.
> Yes, had added respective ILOs as fence devices for both the servers
> and created users also.
We are doing so as well. Always a power user for ilo devices.
We are also automating this with the ilo client.
There is a undocumented switch -x in the fence_ilo client referenced above
where you reference a file that might look as follows and you'll have your
> I just want to make sure that automatic fencing happens and failover
> takes place even when there is a complete power failure for one node
If the timeout thing works you'll also need a second fence mechanism.
You might think about using fence_manual as last resort, to bring that cluster
back online after power failure and then after manual intervention.

Regards Marc.

Just wondering if there is any undocumented option / switch which will
force an automatic failover to one node if the ILO on the other one
fails to respond within certain time period (maybe few minutes).


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]