[Linux-cluster] Graceful Degradation
gordan at bobich.net
gordan at bobich.net
Fri Dec 14 17:42:02 UTC 2007
On Fri, 14 Dec 2007, Roger Peña wrote:
>
> --- gordan at bobich.net wrote:
>
>> On Fri, 14 Dec 2007, Roger Peña wrote:
>>
>>> I thinks this is question #1 in the FAQs and in
>> this
>>> list :-)
>>>
>>> the short anwser and the first place to look at
>> is:
>>> 1- fencing not configured or configured as manual
>>> 2- fencing problems, the devices not working as
>> they
>>> should
>>
>> The problem is that I don't have any devices I could
>> do fencing with. Is
>
> you do not have:
> 1- shared storage? usually, the "server" of the shared
> storage have a way to cut the storage to a client, so
> this can serve as a fencing device
> 2- what kind of server do you have? HP servers has
> iLo, SUN and Dell servers have something similar. so
> those interfaces can act as fencing devices
I have Dell servers, but nothing that can be used to monitor them.
I'm really only looking for something simple - if a node fails 10 pings in
a row or fails to respond to a ping in 10 seconds, kick it off. If it
rejoins (on boot-up), then it should be allowed to join.
If all nodes monitor all other nodes, and kick the ones they can't
contact, they'll either fence the dead node, or the dead node will fence
off itself if there's a NIC failure. Or if the switch fails they'll all
fence themselves off, but, in that case, so what...
>> there a way to achieve this without external
>> monitoring?
> not that I know off,
> but I don't want to :-), I would like to be sure that
> a node with problems gets kicked from the cluster so
> it did not mess things that is why I will decline to
> start a cluster without at least a first level of
> fencing.
Except I don't have any fail-over services per se. All nodes run all
services. If a node fails, it won't respond and the load-balancer will
just stop directing TCP traffic to it.
At the moment, I'm thinking about the fencing console in the OSR tools,
and writing a small monitoring daemon in perl to use it to kick out the
nodes that aren't responding. It's just that it'd be nice if there was
already something out there that'll do this...
Gordan
More information about the Linux-cluster
mailing list