Re: [Linux-cluster] Re: More CS4 fencing fun

On Fri, 2006-03-24 at 11:06 +0100, Matteo Catanese wrote:
> Hi Lon,
> you mail is "music" for my ears :D
> I will try your /sbin/fence_dontcare immediately.

Best wishes!  If it breaks, all of the pieces are yours to keep.

> i dont want to be interrupted in weekends when i play my  
> favourite video game (WOW) just because ONE component broke and all  
> cluster hung :-)

Great game.

> Sure our hardware configuration can sustain also some multi-point  
> failure, but NSPOF is our mail goal

Remember that a redundant remote power switch doesn't obviate the need
for iLO.  iLO is *much* more than a power button.  It has remote console
abilities and other management stuff -- all which is very useful for
system administration and maintenance.

In my opinion, the power-button feature of iLO is the *least* useful

> In my case WTI should be useful only in case of multiple failure, for  
> example both network switch fails so heartbeat fails and ilo fails  
> too  and with /sbin/fence_dontcare i will have corruption. Is this  
> correct ?

With the dontcare hack, you can have corruption if the node stops
heartbeating (for any reason) and iLO does not respond at the time
fence_ilo is called.

Examples - Live-hang of the node with the iLO disconnected, too much
system load to get out heartbeats, network congestion/saturation, bad
cables, routing problems, internal problem in the switch, ARP storms,
power surges, iLO bugs/failure, too many people logged in to iLO, etc.

I do not know all of the possible the failure case(s).  That is why the
last cluster I set up has a remote power controller, even though all of
the nodes individually have iLO as well.  Call me paranoid if you want,
but please, think about these two points:

(1) Uptime with corrupt data does not equal availability

... and, more importantly ...

(2) It *really* sucks to have to restore from backup when you could be
playing WoW...

> I will need a supplemental NIC for every server to connect to WTI,  

Actually, it should be on the same network as the cluster uses for
communications, especially in two-node CMAN/DLM clusters; check out:


-- Lon

