[Linux-cluster] Graceful Degradation

gordan at bobich.net gordan at bobich.net
Fri Dec 14 17:42:02 UTC 2007


On Fri, 14 Dec 2007, Roger Peña wrote:

>
> --- gordan at bobich.net wrote:
>
>> On Fri, 14 Dec 2007, Roger Peña wrote:
>>
>>> I thinks this is question #1 in the FAQs and in
>> this
>>> list :-)
>>>
>>> the short anwser and the first place to look at
>> is:
>>> 1- fencing not configured or configured as manual
>>> 2- fencing problems, the devices not working as
>> they
>>> should
>>
>> The problem is that I don't have any devices I could
>> do fencing with. Is
>
> you do not have:
> 1- shared storage? usually, the "server" of the shared
> storage have a way to cut the storage to a client, so
> this can serve as a fencing device
> 2- what kind of server do you have? HP servers has
> iLo, SUN and Dell servers have something similar. so
> those interfaces can act as fencing devices

I have Dell servers, but nothing that can be used to monitor them.

I'm really only looking for something simple - if a node fails 10 pings in 
a row or fails to respond to a ping in 10 seconds, kick it off. If it 
rejoins (on boot-up), then it should be allowed to join.

If all nodes monitor all other nodes, and kick the ones they can't 
contact, they'll either fence the dead node, or the dead node will fence 
off itself if there's a NIC failure. Or if the switch fails they'll all 
fence themselves off, but, in that case, so what...

>> there a way to achieve this without external
>> monitoring?
> not that I know off,
> but I don't want to :-), I would like to be sure that
> a  node with problems gets kicked from the cluster so
> it did not mess things that is why I will decline to
> start a cluster without at least a first level of
> fencing.

Except I don't have any fail-over services per se. All nodes run all 
services. If a node fails, it won't respond and the load-balancer will 
just stop directing TCP traffic to it.

At the moment, I'm thinking about the fencing console in the OSR tools, 
and writing a small monitoring daemon in perl to use it to kick out the 
nodes that aren't responding. It's just that it'd be nice if there was 
already something out there that'll do this...

Gordan


More information about the Linux-cluster mailing list