[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] More CS4 fencing fun

Hi, im doing failover tests on a CS4 cluster.

I have 2 HP dl380 + HP msa1000 (aka dl380 packaged cluster).

I already read this post

Im clustering a single oracle instance using active/passive. I don't use GFS.

I use fence_ilo

I have a fully working clustered oracle, i tried to migrate oracle instance from a node to another using system-config-cluster and everything works perfectly.

I tried some more "rude" failover tests with this setup:

node1 = active node
node2 = passive node

and those are the results:

Situation 1:

I rudely disconnect the powercable(s) from node1, so that node1 is _completely_ turned off, no current flows in it. ILO is down.

I have redundant powerunits but i wanted to simulate short circuit or motherboard failure

Node2, using fence, tries to poweroff node1

Fence_ilo tries to connect to node1_ilo_ip_address, but ilo is down because of power failure so fencing fails and starts retrying forever.

Result: One node perfectly up but cluster service stalled


I push the on/off button on node1. It stops in 4 seconds, but power is still on, so ILO is up and working.

node2, using fence, tries to poweroff the node1.

ilo is working so fence_ilo correctly connects to node1_ilo_ip_address, it tries for some time to poweroff the already poweroff'd server, then it finally decides that server is off.

Oracle is STILL down, no virtual ip, no storage mounted bla bla bla

Now node2 tries to wake up the turned_off_but_still_powered_ node1.

Node1 wakes up, then it does bootstrap (cluster is still stalled) then joins fence_domain. Fence on node2 completes succesfully and unlocks cluster and everything is up again

Switch time: 55 seconds (+ oracle startup time).

Situation 3:

This is not a real failover test.

Everything is off. I turn on the msa1000 and wait for its bootstrap. Then i turn on node1 but i still have node2 electrically disconnected.

Node1 tries to turn on node2 to complete the fence_domain, node2 is disconnected from power current so it will never wake up.

Cluster is stalled

Can you change fence behaviour to be less "radical" ?

If ILO is unreachable means that machine is already off and could not be powered on so fence shold spit out a warning and let the failover happen

If ILO is reachable then check its status to avoid pointless poweroff/ poweron

As of today fence is really dangerous in a production environment, for now i will turn it off


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]