|I am currently using a n nodes configuration with a qdiskd process to sustain a n-1 node failure.|
The simplest case is a two node :
<cluster config_version="79" name="xxx">
<cman expected_votes="3" two_node="0"/>
<clusternode name="n1" nodeid="1" votes="1">
<clusternode name="n2" nodeid="2" votes="1">
<quorumd cman_label="qdisk1" device="/dev/yyy" interval="2" tko="10" votes="1" reboot="0" allow_kill="0" status_file="/qdiskstat">
I am experiencing some times a loss of quorum on the over node when I shutdown gracefully a node using the following :
# service rgmanager stop
After looking more precisely to the problem, I just discover that the problem is that the node I shutdown is the master qdisk node, so when I shutdown qdiskd and cman on the first node, the second node experience a loss of qdisk vote (because the second node sees that qdisk master is not avail and start the election of the new master) and almost simultaneouly a loss of the first node vote because it has leaved the cluster.
The effect is that the second node experience a loss of quorum during about 20 seconds, the time to elect himself as qdisk master. The problem is that rgmanager sees the loss of quorum and shutdowns all the virtual machines that are under its control !!!
If I wait 20 seconds between the "service qdiskd stop" and "service cman stop", I don't get the problem because the second node get the time to elect himself master.
I was thinking qdiskd is supposed to be a process to maintain the quorum independently of the cman communication.
Either I make a mistake or misuse of qdiskd, or there is something to change in the handling of qdiskd votes.
One solution may be for a node that was not qdiskd master, and was issuing votes to cman to maintain this vote until a new master election succeeds instead of removing its vote until the master reelection succeeds ?