[Linux-cluster] Two-node cluster: Node attempts stateful merge after clean reboot

Wed Sep 11 17:31:47 UTC 2013

On 11/09/13 08:50, Pascal Ehlert wrote:
>> The problem is that, if you enable cman on boot, the fenced node will
>> try to join the cluster, fail to reach it's peer after post_join_delay
>> (default 6 seconds, iirc) and fence it's peer. That peer reboots,
>> starts cman, tries to connect, fenced it's peer...
>>
>> The easiest way to avoid this in 2-node clusters is to not let
>> cman/rgmanager start automatically. That way, if a node is fenced, it
>> will boot back up and you can log into remotely (assuming it's not
>> totally dead). When you know things are fixed, manually start cman.
>>
> I my case however, the node which is trying to join is fully operational
> and has network access. Also if you look at the configuration that I had
> in my original email, my post_join_delay is 360 (for testing purposes),
> so there is no way that a timeout occurs.
>
> I might be wrong here, but judging from corosync's log file, the other
> node even joins the cluster successfully, before being marked for
> fencing by dlm_controld:
>
>     Sep 11 11:14:09 corosync [CLM   ] CLM CONFIGURATION CHANGE
>     Sep 11 11:14:09 corosync [CLM   ] New Configuration:
>     Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.1)
>     Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
>     Sep 11 11:14:09 corosync [CLM   ] Members Left:
>     Sep 11 11:14:09 corosync [CLM   ] Members Joined:
>     Sep 11 11:14:09 corosync [CLM   ]     r(0) ip(10.xx.xx.2)
>     Sep 11 11:14:09 corosync [TOTEM ] A processor joined or left the membership and a new membership was formed.
>     Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2
>     Sep 11 11:14:09 corosync [QUORUM] Members[2]: 1 2

Setting post_join_delay to 360 will buy you 6 minutes from the start of 
cman until the fence occurs.

That log message does show the node joining. Can you reliably reproduce 
this? If so, can you please 'tail -f -n 0 /var/log/messages' on both 
nodes, break the cluster and wait for the node to restart, 'tail' the 
rebooted node's /var/log/messages, wait the six minutes and then, after 
the second fence occurs, post both node's logs?

-- 
Digimer
Papers and Projects: https://alteeve.ca/w/
What if the cure for cancer is trapped in the mind of a person without 
access to education?