[Linux-cluster] CMAN: sending membership request, unable to join cluster.

Wed Feb 11 09:17:30 UTC 2009

thijn wrote:
> Hi,
> 
> I have the following problem.
> CMAN: removing node [server1] from the cluster : Missed too many
> heartbeats
> When the server comes back up:
> Feb 10 14:43:58 server1 kernel: CMAN: sending membership request
> after which it will try  to join until the end of times.
> 
> In the current problem, server2 is active and server1 has the problem
> not being able to join the cluster.
> 
> The setup is a two server setup cluster.
> We have had the problem on several clusters.
> We "fixed" it usualy with rebooting the other node at which the cluster
> would repair itself and all ran smoothly from thereon.
> Naturally this will disrupt any services running on the cluster. And its
> not really a solution that will win prices.
> The problem is that server1(the problem one) is in a inquorate state and
> we are unable to get it to a quorate state, neither do we see why this
> is the case.
> We tried to use a test setup to replay the problem, we were unable.
> 
> So we decided to try to find a way to fix the state of the cluster using
> the tools the system provides.
> 
> The problem we see presents itself after a fence action by either node.
> When we would bring down both nodes to stabilize the issue, the cluster
> would become healthy and after that we can reboot either node and it
> will rejoin the cluster.
> It seems the problem presents itself when "pulling the plug" out of the
> server.
> We run on IBM Xservers using the SA-adapter as a fence device.
> The fence device is in a different subnet then the subnet on which the
> cluster communicates.
> Bot fence devices are on the same subnet/vlan.
> 
> CentOS release 4.6 (Final)
> Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686
> i686 i386 GNU/Linux
> cman_tool 1.0.17 (built Mar 20 2007 17:10:52)
> Copyright (C) Red Hat, Inc.  2004  All rights reserved.
> 
> All versions of libraries and packages, kernel modules and all that is
> dependent for the GFS cluster to operate are identical on both nodes.
> 
> Cluster.conf
> [root at server1 log]# cat /etc/cluster/cluster.conf
> <?xml version="1.0"?>
> <cluster config_version="3" name="NAME_cluster">
> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> <clusternodes>
> <clusternode name="server1.production.loc" votes="1">
> <fence>
> <method name="1">
> <device name="saserver1"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="server2.production.loc" votes="1">
> <fence>
> <method name="1">
> <device name="saserver2"/>
> </method>
> </fence>
> </clusternode>
> </clusternodes>
> <cman expected_votes="1" two_node="1"/>
> <fencedevices>
> <fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter"
> name="saserver1" passwd="XXXXXXX"/>
> <fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter"
> name="saserver2" passwd="XXXXXXX"/>
> </fencedevices>
> <rm>
> <failoverdomains/>
> <resources/>
> </rm>
> </cluster>
> 
> [root at server1 log]# cat /etc/hosts
> 127.0.0.1 localhost.localdomain localhost
> 
> Both server are able to ping each other and also the broadcast address,
> so there is no firewall filtering UDP packets
> When i tcpdump the line i see traffic going both ways, 
> 
> Both servers are in the same vlan
> 14:51:28.703240 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> 17, length: 56) server2.production.loc.6809 >
> broadcast.production.loc.6809: UDP, length 28
> 14:51:28.703277 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> 17, length: 140) server1.production.loc.6809 >
> server2.production.loc.6809: UDP, length 112
> 14:51:33.703240 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> 17, length: 56) server2.production.loc.6809 >
> broadcast.production.loc.6809: UDP, length 28
> 14:51:33.703310 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
> 17, length: 140) server1.production.loc.6809 >
> server2.production.loc.6809.6809: UDP, length 112
> 
> Is this normal network behavior when a cluster is inquorate?
> I see that server1 is talking to server2, but server2 is only talking in
> broadcasts.
> 
> When i start of try to join the cluster
> Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed
> 
> [root at server1 ~]# cman_tool status
> Protocol version: 5.0.1
> Config version: 3
> Cluster name: NAME_cluster
> Cluster ID: 64692
> Cluster Member: No
> Membership state: Joining
> 
> [root at server2 log]# cman_tool status
> Protocol version: 5.0.1
> Config version: 3
> Cluster name: RWSEems_cluster
> Cluster ID: 64692
> Cluster Member: Yes
> Membership state: Cluster-Member
> Nodes: 1
> Expected_votes: 1
> Total_votes: 1
> Quorum: 1   
> Active subsystems: 7
> Node name: server2.production.loc
> Node ID: 2
> Node addresses: server1.production.loc
> 
> [root at server1 ~]# cman_tool nodes
> Node  Votes Exp Sts  Name
> 
> [root at server2 log]# cman_tool nodes
> Node  Votes Exp Sts  Name
>    1    1    1   X   server1.production.loc
>    2    1    1   M   server2.production.loc
> 
> When i start cman
> service cman start
> 
> Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a
> Linux-cluster
> Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture
> via: CMAN/SM Plugin v1.1.7.4
> Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate
> 
> 
> It seems to me that this should be fixable with the tools as provided
> with the RedHat Cluster Suite, without disturbing the running cluster.
> It seems quite insane if i need to restart my cluster to have it all
> working again.. kinda spoils the idea of running a cluster.
> This setup is running in a HA envirmoment and we can have nearly to no
> downtime.
> 
> The logs on the healthy server (server2) does not mention/complain
> anything of errors when rebooting, restarting cman or when server1 want
> to join the cluster.
> We see no disallowed, refused or anything that server2 is not willing to
> play with server1
> 
> I have been looking at this thing for a while now.. am i missing
> anything?
> 

This is a known bug, see

https://bugzilla.redhat.com/show_bug.cgi?id=475293

It's fixed in 4.7 or you can run a program to set up a workaround.

Having said that I have heard reports of is still happening in some
circumstances ... but I don't have any more detail

-- 

Chrissie