[Linux-cluster] Fence device, How it work

Michael Will mwill at penguincomputing.com
Wed Nov 9 00:46:40 UTC 2005


I was more thinking along those lines:

1. node A fails
2. node B reboots node A
3. node A fails again because it has not been fixed.

now we could have a 2-3-2 loop. worst case situation is
that 3. is actually
3.1 node A comes up and starts reaquiring its ressource
3.2 node A fails again because it has not been fixed
3.3 goto 2

Your recommendation f/g is exactly what I was wondering about
as an alternative. I know it is possible but try to understand
why it would not be the default behavior.

In active/passive heartbeat style setups I set the nice-failback
option so it does not try to reclaim ressources unless the other
node fails, but I wonder what is the best path in a multinode
active/active setup.

Michael

Lon Hohberger wrote:
> On Tue, 2005-11-08 at 07:52 -0800, Michael Will wrote:
>   
>>> Power-cycle. 
>>>       
>> I always wondered about this. If the node has a problem, chances are 
>> that rebooting does not
>> fix it. Now if the node comes up semi-functional and attempts to regain 
>> control over the ressource
>> that it owned before, then that could be bad. Should it not rather be 
>> shut-down so an human intervention
>> can fix it before it is being made operational again?
>>     
>
> This is a bit long, but maybe it will clear some things up a little.  As
> far as a node taking over a resource it thinks it still has after a
> reboot (without notifying the other nodes of its intentions), that would
> be a bug the cluster software, and a really *bad* one too!
>
> A couple of things to remember when thinking about failures and fencing:
>
> (a) Failures are rare.  A decent PC has something like a 99.95% uptime
> (I wish I knew where I heard/read this long ago) uptime - with no
> redundancy at all.  A server with ECC RAM, RAID for internal disks, etc.
> probably has a higher uptime.
>
> (b) The hardware component most likely to fail is a hard disk (moving
> parts).  If that's the root hard disk, the machine probably won't boot
> again.  If it's the shared RAID set, then the whole cluster will likely
> have problems.
>
> (c) I hate to say this, but the kernel is probably more likely to fail
> (panic, hang) than any single piece of hardware.
>
> (d) Consider this (I think this is an example of what you said?):
>     1. Node A fails
>     2. Node B reboots node A
>     3. Node A correctly boots and rejoins cluster
>     4. Node A mounts a GFS file system correctly
>     5. Node A corrupts the GFS file system
>
> What is the chance that 5 will happen without data corruption occurring
> during before 1?  Very slim, but nonzero - which brings me to my next
> point...
>
> (e) Always make backups of critical data, no matter what sort of block
> device or cluster technology you are using.  A bad RAM chip (e.g. an
> parity RAM chip missing a double-bit errors) can cause periodic, quiet
> data corruption.  Chances of this happening are also very slim, but
> again, nonzero.  Probably at least as likely to happen as (d).
>
> (f) If you're worried about (d) and are willing to take the expected
> uptime hit for a given node when that node fails, even given (c), you
> can always change the cluster configuration to turn "off" a node instead
> of reboot it. :)
>
> (g) You can chkconfig --del the cluster components so that they don't
> automatically start on reboot; same effect as (f): the node won't
> reacquire the resources if it never rejoins the cluster...
>
>
>   
>> I/O fencing instead of power fencing kind of works like this, you undo 
>> the i/o block once you know
>> the node is fine again.
>>     
>
> Typically, we refer to that as "fabric level fencing" vs. "power level
> fencing", both fit in with the I/O fencing paradigm in preventing a node
> from flushing buffers after it has misbehaved.
>
> Note that typically the only way to be 100% positive a node has no
> buffers waiting after it has been fenced at the fabric level is a hard
> reboot.
>
> Many administrators will reboot a failed node as a first attempt to fix
> it anyway - so we're just saving them a step :)  (Again, if you want,
> you can always do (f) or (g) above...)
>
> -- Lon
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>   


-- 
Michael Will
Penguin Computing Corp.
Sales Engineer
415-954-2822
415-954-2899 fx
mwill at penguincomputing.com 





More information about the Linux-cluster mailing list