[Linux-cluster] [cman] cant joint cluster after reboot

Fri Nov 8 08:16:43 UTC 2013

Thanks, the problem indeed in multicast. Switching to udpu brought 
cluster to normal operation.

Any tips how to fix multicast operation? igmp snooping on switch is 
disabled, firewall disabled too.
In fact, what confuses me is that node cant join cluster after reboot no 
matter how long i'll wait after reboot, no matter how many times i'll do 
"service cman restart" on that node - it just dont work until cman 
restarted on some other node.
Another strange thing - i've used tcpdump to capture udp traffic and 
there were no udp traffic at all from node-1 after reboot, no traffic 
after service restarts. But as soon as service restarted on other node - 
udp traffic to multicast address appeared from node-1.

I've also tried to switch igmp snooping on, but that caused cluster not 
working at all - each node saw only itself. On switch I saw that 
multicast group was created, each corresponded port became member of 
that group, but packet statistics shown only few "report v3" packets, no 
query/leave/error packets.

Yuriy Demchenko

On 11/07/2013 05:47 PM, Christine Caulfield wrote:
> On 07/11/13 12:04, Yuriy Demchenko wrote:
>> Hi,
>>
>> I'm trying to set up 3-node cluster (2 nodes + 1 standby node for
>> quorum) with cman+pacemaker stack, everything according this quickstart
>> article: http://clusterlabs.org/quickstart-redhat.html
>>
>> Cluster starts, all nodes see each other, quorum gained, stonith
>> working, but I've run into problem with cman: node cant join cluster
>> after reboot - cman starts and cman_tool nodes reports only that node as
>> cluster-member, while on other 2 nodes it reports 2 nodes as
>> cluster-member and 3rd as offline. cman stop/start/restart on the
>> problem node does no effect - it still can see only itself, but if i'll
>> do cman restart on one of working nodes - everything goes back to
>> normal, all 3 nodes joins the cluster and subsequent cman service
>> restarts on any nodes works fine - node lefts cluster and rejoins
>> sucessfully. But again - only till node OS reboot.
>>
>> For example:
>> [1] Working cluster:
>>> [root at node-1 ~]# cman_tool nodes
>>> Node  Sts   Inc   Joined               Name
>>>    1   M    592   2013-11-07 15:20:54  node-1.spb.stone.local
>>>    2   M    760   2013-11-07 15:20:54  node-2.spb.stone.local
>>>    3   M    760   2013-11-07 15:20:54  vnode-3.spb.stone.local
>>> [root at node-1 ~]# cman_tool status
>>> Version: 6.2.0
>>> Config Version: 10
>>> Cluster Name: ocluster
>>> Cluster Id: 2059
>>> Cluster Member: Yes
>>> Cluster Generation: 760
>>> Membership state: Cluster-Member
>>> Nodes: 3
>>> Expected votes: 3
>>> Total votes: 3
>>> Node votes: 1
>>> Quorum: 2
>>> Active subsystems: 7
>>> Flags:
>>> Ports Bound: 0
>>> Node name: node-1.spb.stone.local
>>> Node ID: 1
>>> Multicast addresses: 239.192.8.19
>>> Node addresses: 192.168.220.21
>> Picture is same on all 3 nodes (except for node name and id) - same
>> cluster name, cluster id, multicast addres.
>>
>> [2] I've put node-1 into reboot. After reboot complete, "cman_tool
>> nodes" on node-2 and vnode-3 shows this:
>>> Node  Sts   Inc   Joined Name
>>>    1   X    760                        node-1.spb.stone.local
>>>    2   M    588   2013-11-07 15:11:23  node-2.spb.stone.local
>>>    3   M    760   2013-11-07 15:20:54  vnode-3.spb.stone.local
>>> [root at node-2 ~]# cman_tool status
>>> Version: 6.2.0
>>> Config Version: 10
>>> Cluster Name: ocluster
>>> Cluster Id: 2059
>>> Cluster Member: Yes
>>> Cluster Generation: 764
>>> Membership state: Cluster-Member
>>> Nodes: 2
>>> Expected votes: 3
>>> Total votes: 2
>>> Node votes: 1
>>> Quorum: 2
>>> Active subsystems: 7
>>> Flags:
>>> Ports Bound: 0
>>> Node name: node-2.spb.stone.local
>>> Node ID: 2
>>> Multicast addresses: 239.192.8.19
>>> Node addresses: 192.168.220.22
>> But, on rebooted node-1 it shows this:
>>> Node  Sts   Inc   Joined Name
>>>    1   M    764   2013-11-07 15:49:01  node-1.spb.stone.local
>>>    2   X      0                        node-2.spb.stone.local
>>>    3   X      0                        vnode-3.spb.stone.local
>>> [root at node-1 ~]# cman_tool status
>>> Version: 6.2.0
>>> Config Version: 10
>>> Cluster Name: ocluster
>>> Cluster Id: 2059
>>> Cluster Member: Yes
>>> Cluster Generation: 776
>>> Membership state: Cluster-Member
>>> Nodes: 1
>>> Expected votes: 3
>>> Total votes: 1
>>> Node votes: 1
>>> Quorum: 2 Activity blocked
>>> Active subsystems: 7
>>> Flags:
>>> Ports Bound: 0
>>> Node name: node-1.spb.stone.local
>>> Node ID: 1
>>> Multicast addresses: 239.192.8.19
>>> Node addresses: 192.168.220.21
>> so, same cluster name, cluster id, multicast address - but it cant see
>> other nodes. And there are nothing in /var/log/messages and
>> /var/log/cluster/corosync.log on other two nodes - they seem not notice
>> node-1 coming back online at all, last records about node-1 leaving
>> cluster.
>>
>> [3] If now i do "service cman restart" on node-2 or vnode-3 - everything
>> goes back to normal operation as in [1]
>> in logs it shows as node-2 leaving cluster (service stop) and
>> simultaneously joining of both node-2 and node-1 (service start)
>>> Nov  7 11:47:06 vnode-3 corosync[26692]: [QUORUM] Members[2]: 2 3
>>> Nov  7 11:47:06 vnode-3 corosync[26692]:   [TOTEM ] A processor joined
>>> or left the membership and a new membership was formed.
>>> Nov  7 11:47:06 vnode-3 kernel: dlm: closing connection to node 1
>>> Nov  7 11:47:06 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
>>> sender r(0) ip(192.168.220.22) ; members(old:3 left:1)
>>> Nov  7 11:47:06 vnode-3 corosync[26692]:   [MAIN  ] Completed service
>>> synchronization, ready to provide service.
>>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [QUORUM] Members[1]: 3
>>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [TOTEM ] A processor joined
>>> or left the membership and a new membership was formed.
>>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
>>> sender r(0) ip(192.168.220.14) ; members(old:2 left:1)
>>> Nov  7 11:53:28 vnode-3 corosync[26692]:   [MAIN  ] Completed service
>>> synchronization, ready to provide service.
>>> Nov  7 11:53:28 vnode-3 kernel: dlm: closing connection to node 2
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [TOTEM ] A processor joined
>>> or left the membership and a new membership was formed.
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[2]: 1 3
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [QUORUM] Members[3]: 1 2 3
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [CPG   ] chosen downlist:
>>> sender r(0) ip(192.168.220.21) ; members(old:1 left:0)
>>> Nov  7 11:53:30 vnode-3 corosync[26692]:   [MAIN  ] Completed service
>>> synchronization, ready to provide service.
>>
>> I've set up such cluster before in quite same configuration and never
>> had any problems, but now I'm completely stuck.
>> So, what is wrong with my cluster and how to fix it?
>>
>> OS Centos 6.4 with lastest updates, firewall disabled, selinux
>> permissive, all 3 nodes inside same network. Multicast working - checked
>> with omping.
>> cman.x86_64                   3.0.12.1-49.el6_4.2 @centos6-updates
>> corosync.x86_64               1.4.1-15.el6_4.1 @centos6-updates
>> pacemaker.x86_64              1.1.10-1.el6_4.4 @centos6-updates
>>
>
>
>
> omping working doesn't guaratee that multicast is working. If it fails 
> then it doesn't work but the converse is not necessarily true.
>
> I suspect that multicast is causing the trouble here, so try broadcast 
> or udpu as transport and see if that helps. Often the switch will take 
> some time to recognise the multicast addresses when a node starts up 
> and that will break the join protocol.
>
> Chrissie
>