[Linux-cluster] CMAN: sending membership request, unable to join cluster.

Wed Feb 11 08:57:24 UTC 2009

Hi,

I have the following problem.
CMAN: removing node [server1] from the cluster : Missed too many
heartbeats
When the server comes back up:
Feb 10 14:43:58 server1 kernel: CMAN: sending membership request
after which it will try  to join until the end of times.

In the current problem, server2 is active and server1 has the problem
not being able to join the cluster.

The setup is a two server setup cluster.
We have had the problem on several clusters.
We "fixed" it usualy with rebooting the other node at which the cluster
would repair itself and all ran smoothly from thereon.
Naturally this will disrupt any services running on the cluster. And its
not really a solution that will win prices.
The problem is that server1(the problem one) is in a inquorate state and
we are unable to get it to a quorate state, neither do we see why this
is the case.
We tried to use a test setup to replay the problem, we were unable.

So we decided to try to find a way to fix the state of the cluster using
the tools the system provides.

The problem we see presents itself after a fence action by either node.
When we would bring down both nodes to stabilize the issue, the cluster
would become healthy and after that we can reboot either node and it
will rejoin the cluster.
It seems the problem presents itself when "pulling the plug" out of the
server.
We run on IBM Xservers using the SA-adapter as a fence device.
The fence device is in a different subnet then the subnet on which the
cluster communicates.
Bot fence devices are on the same subnet/vlan.

CentOS release 4.6 (Final)
Linux server2 2.6.9-67.ELsmp #1 SMP Fri Nov 16 12:48:03 EST 2007 i686
i686 i386 GNU/Linux
cman_tool 1.0.17 (built Mar 20 2007 17:10:52)
Copyright (C) Red Hat, Inc.  2004  All rights reserved.

All versions of libraries and packages, kernel modules and all that is
dependent for the GFS cluster to operate are identical on both nodes.

Cluster.conf
[root at server1 log]# cat /etc/cluster/cluster.conf
<?xml version="1.0"?>
<cluster config_version="3" name="NAME_cluster">
<fence_daemon post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="server1.production.loc" votes="1">
<fence>
<method name="1">
<device name="saserver1"/>
</method>
</fence>
</clusternode>
<clusternode name="server2.production.loc" votes="1">
<fence>
<method name="1">
<device name="saserver2"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman expected_votes="1" two_node="1"/>
<fencedevices>
<fencedevice agent="fence_rsa" ipaddr="10.13.110.114" login="saadapter"
name="saserver1" passwd="XXXXXXX"/>
<fencedevice agent="fence_rsa" ipaddr="10.13.110.115" login="saadapter"
name="saserver2" passwd="XXXXXXX"/>
</fencedevices>
<rm>
<failoverdomains/>
<resources/>
</rm>
</cluster>

[root at server1 log]# cat /etc/hosts
127.0.0.1 localhost.localdomain localhost

Both server are able to ping each other and also the broadcast address,
so there is no firewall filtering UDP packets
When i tcpdump the line i see traffic going both ways, 

Both servers are in the same vlan
14:51:28.703240 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
17, length: 56) server2.production.loc.6809 >
broadcast.production.loc.6809: UDP, length 28
14:51:28.703277 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
17, length: 140) server1.production.loc.6809 >
server2.production.loc.6809: UDP, length 112
14:51:33.703240 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
17, length: 56) server2.production.loc.6809 >
broadcast.production.loc.6809: UDP, length 28
14:51:33.703310 IP (tos 0x0, ttl  64, id 0, offset 0, flags [DF], proto
17, length: 140) server1.production.loc.6809 >
server2.production.loc.6809.6809: UDP, length 112

Is this normal network behavior when a cluster is inquorate?
I see that server1 is talking to server2, but server2 is only talking in
broadcasts.

When i start of try to join the cluster
Feb 10 09:36:06 server1 cman: cman_tool: Node is already active failed

[root at server1 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 3
Cluster name: NAME_cluster
Cluster ID: 64692
Cluster Member: No
Membership state: Joining

[root at server2 log]# cman_tool status
Protocol version: 5.0.1
Config version: 3
Cluster name: RWSEems_cluster
Cluster ID: 64692
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1   
Active subsystems: 7
Node name: server2.production.loc
Node ID: 2
Node addresses: server1.production.loc

[root at server1 ~]# cman_tool nodes
Node  Votes Exp Sts  Name

[root at server2 log]# cman_tool nodes
Node  Votes Exp Sts  Name
   1    1    1   X   server1.production.loc
   2    1    1   M   server2.production.loc

When i start cman
service cman start

Feb 10 14:06:30 server1 kernel: CMAN: Waiting to join or form a
Linux-cluster
Feb 10 14:06:30 server1 ccsd[21964]: Connected to cluster infrastruture
via: CMAN/SM Plugin v1.1.7.4
Feb 10 14:06:30 server1 ccsd[21964]: Initial status:: Inquorate

It seems to me that this should be fixable with the tools as provided
with the RedHat Cluster Suite, without disturbing the running cluster.
It seems quite insane if i need to restart my cluster to have it all
working again.. kinda spoils the idea of running a cluster.
This setup is running in a HA envirmoment and we can have nearly to no
downtime.

The logs on the healthy server (server2) does not mention/complain
anything of errors when rebooting, restarting cman or when server1 want
to join the cluster.
We see no disallowed, refused or anything that server2 is not willing to
play with server1

I have been looking at this thing for a while now.. am i missing
anything?

Thank you in advance

-- 
with kind regards,

E.Novation Hosting Center
Thijn van der Schoot
Operations: Unix & Network