[Linux-cluster] Re: Power based fencing in cluster causes single point of failure that can take down a cluster

Josef Whiter wrote:
You can either have redundant fence devices, or look into qdisk.

Thanks for the reply. Can you explain how qdisk would solve the problem? It seems to me that the fencing device failing which simultaneously causes the cluster member to fail wouldn't be affected by qdisk.

Does qdisk have some feedback mechanism that tells the cluster that it's ok to restart the failed services on another node without fencing being successful? I can't see how that can work reliably and still prevent split brain problems.

On Tue, Jan 09, 2007 at 10:50:53AM -0800, Jonathan Biggar wrote:
If we set up a cluster and use network power switches for fencing, won't the failure of the power switch attached to a cluster member cause all services that were running on that node to fail to migrate to other cluster members?

This seems to happen to us in practice, because fencing the offline member fails due to the power switch being unavailable, so rgmanager never migrates the failed service(s) to another member.

Is there a general solution to this problem that I'm missing?

Jon Biggar
jon levanta com

