[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Starter Cluster / GFS



On 10-11-10 10:29 PM, Jankowski, Chris wrote:
> Digimer,
> 
> 1.
> Digimer wrote:
>>>> Both partitions will try to fence the other, but the slower will lose and get fenced before it can fence.
> 
> Well, this is certainly not my experience in dealing with modern rack mounted or blade servers where you use iLO (on HP) or DRAC (on Dell).
> 
> What actually happens in two node clusters is that both servers issue the fence request to the iLO or DRAC. It gets processed and *both* servers get powered off.  Ouch!!  Your 100% HA cluster becomes 100% dead cluster.

That is somewhat frightening. My experience is limited to stock IPMI and
Node Assassin. I've not seen a situation where both die. I'd strongly
suggest that a bug be filed.

> 2.
> Your comment did not explain what role the quorum disk plays in the cluster.  Also, if there are any useful cluster quorum disk heuristics that can be used in this case.
> 
> Thanks and regards,
> 
> Chris Jankowski

Ah, the idea is that, with the quorum disk (ignoring heuristics for the
moment), if only one node is left alive, the quorum disk will contribute
sufficient votes for quorum to be achieved. Of course, this depends on
the node(s) having access to the qdisk still.

Now for heuristics; Consider this;

you have a 7-node cluster;
- Each node gets 1 vote.
- The qdisk gets 6 votes.
- Total votes is 13, quorum then is >= 7.

You cluster partitions, say from a network failure. Six nodes separate
from a core switch, while one happens to still have access to a critical
route (say, to the Internet). The heuristic test (ie: pinging an
external server) will pass for the 1 node and fail for the six others.

The one node with access to the critical route will be the one to get
the votes of the quorum disk (1 + 6 = 7, quorum!) while the other six
will get six votes (1 + 1 + 1 + 1 + 1 + 1 = 6, no quorum). The six nodes
will lose and be fenced and will not be able to rejoin the cluster until
they regain access to that critical route.

-- 
Digimer
E-Mail: digimer alteeve com
AN!Whitepapers: http://alteeve.com
Node Assassin:  http://nodeassassin.org


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]