[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] manual intervention 1 node when fencing fails due to complete power outage

On 07/02/14 11:13 AM, Benjamin Budts wrote:

We're not all gents. ;)

I have a 2 node setup (with quorum disk), redhat 6.5 & a luci mgmt console.

Everything has been configured and we’re doing failover tests now.

Couple of questions I have :

·When I simulate a complete power failure of a servers pdu’s (no more
access to idrac fencing or APC PDU fencing) I can see that the fencing
of that node who was running the application fails ßI  noticed unless
fencing returns an OK I’m stuck and my application won’t start on my
2^nd node. Which is ok I guess, because no fencing could mean there is
still I/O on my san.

This is expected. If a lost node can't be put into a known state, there is no safe way to proceed. To do so would be to risk a split brain at least, and data loss/corruption at worst.

The way I deal with this is to have nodes with redundant power supplies and use two PDUs and two UPSes. This way, the failure of on cirtcuit / UPS / PDU doesn't knock out the power to the mainboard of the nodes, so you don't lose IPMI.

Clustat also shows on the active node that the 1^st node is still
running the application.

That's likely because rgmanager uses DLM, and DLM blocks until the fence succeeds, so it can't update it's view.

How can I intervene manually, so as to force a start of the application
on the node that is still alive ?

If you are *100% ABSOLUTELY SURE* that the lost node has been powered off, then you can run 'fence_ack_manual'. Please be super careful about this though. If you do this, in the heat of the moment with clients or bosses yelling at you, and the peer isn't really off (ie: it's only hung), you risk serious problems.

I can not emphasis strongly enough the caution needed when using this command.

Is there a way to tell the cluster, don’t take into account node 1
anymore and don’t try to fence anymore, just start the application on
the node that is still ok ?

No. That would risk a split brain and data corruption. The only safe option for the cluster, if the face of a failed fence, is to hang. As bad as it is to hang, it's better than risking corruption.

I can’t possibly wait until power returns to that server. Downtime could
be too long.

See the solution I mentioned earlier.

·If I tell a node to leave the cluster in Luci, I would like it to
remain a non-cluster member after the reboot of that node. It rejoins
the cluster automatically after a reboot. Any way to prevent this ?


Don't let cman and rgmanager start on boot. This is always my policy. If a node failed and got fenced, I want it to reboot, so that I can log into it and figure out what happened, but I do _not_ want it back in the cluster until I've determined it is healthy.


Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without access to education?

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]