[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Problems with Cluster



On 6/12/07, Marc Grimme <grimme atix de> wrote:
On Tuesday 12 June 2007 03:29:00 Manish Kathuria wrote:
> On 6/11/07, Robert Gil <Robert Gil americanhm com> wrote:
> > If ilo itself is off, fencing doesn't work.
>
> Isn't there any timeout setting such that if the ILO doesn't respond
> for a certain amount of time, it is treated as fenced and the node is
> considered to be dead and the failover takes place?
As far as I remember there is only a tcp-timeout when establishing the
connection to the ilo-card that takes a very long time to occure (that's a
default setting and takes minutes). I'm not sure how and where to set it.

We did wait for quite some time and followed the messages appearing in
/var/log/messages. It kept on trying to contact the ILO of the node
which was powered off.


But we've had this discussion (especially with ILO-Cards) nearly every time
when using them and therefore and also out of other reasons we had to build
our own fence_ilo agent. I'm quite sure that we solved the timeout problem in
the end. It is set to 10sec per default (Config.timeout).
You can find it at
http://download.atix.de/yum/comoonics/productive/noarch/RPMS/comoonics-bootimage-fenceclient-ilo-0.1-16.noarch.rpm
or directly use the yum/up2date-channel as described here:
http://www.open-sharedroot.org/faq/can-i-use-yum-or-up2date-to-install-the-software/
then install "comoonics-bootimage-fenceclient-ilo" and there you go.

Thanks, I will try and see if they agree to use this version.

>
> > Did you add ilo as a fence device? And create a user? You create a user
> > in the ilo for that blade, not on the chassis. You have to reboot the
> > blade to get to the ilo manager.
>
> Yes, had added respective ILOs as fence devices for both the servers
> and created users also.
We are doing so as well. Always a power user for ilo devices.
We are also automating this with the ilo client.
There is a undocumented switch -x in the fence_ilo client referenced above
where you reference a file that might look as follows and you'll have your
user.
> I just want to make sure that automatic fencing happens and failover
> takes place even when there is a complete power failure for one node
If the timeout thing works you'll also need a second fence mechanism.
You might think about using fence_manual as last resort, to bring that cluster
back online after power failure and then after manual intervention.

Regards Marc.

Just wondering if there is any undocumented option / switch which will
force an automatic failover to one node if the ILO on the other one
fails to respond within certain time period (maybe few minutes).

Regards,
--
Manish


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]