[Linux-cluster] Node can't join already quorated cluster

Wed Jun 20 12:18:17 UTC 2012

Hi, I have a very strange problem, and after searching through lot of
forums, I haven't found the solution. This is the scenario:

Two node cluster with Red Hat 5.7, HA-LVM, no fencing and quorum disk. I
start qdiskd, cman and rgmanager on one node. After 5 minutes, finally the
fencing finishes and cluster get quorate with 2 votes:

[root at node2 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 05:56:39 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-hb                                  1 Offline
 node2-hb                               2 Online, Local, rgmanager
 /dev/mapper/vg_qdisk-lv_qdisk               0 Online, Quorum Disk

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:postgres                   node2                  started

Now, I start the second node. When cman reaches fencing, it hangs for 5
minutes aprox, and finally fails. clustat says:

root at node1 ~]# clustat
Cluster Status for test_cluster @ Wed Jun 20 06:01:12 2012
Member Status: Inquorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
node1-hb                                  1 Online, Local
node2-hb                               2 Offline
 /dev/mapper/vg_qdisk-lv_qdisk               0 Offline

And in /var/log/messages I can see this errors:

Jun 20 06:02:12 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
Jun 20 06:02:12 node1 openais[6098]: [CLM  ] got nodejoin message 15.15.2.10
Jun 20 06:02:13 node1 dlm_controld[5386]: connect to ccs error -111, check
ccsd or cluster status
Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:13 node1 ccsd[6090]: Initial status:: Inquorate
Jun 20 06:02:13 node1 gfs_controld[5392]: connect to ccs error -111, check
ccsd or cluster status
Jun 20 06:02:13 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:13 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:14 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:14 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:14 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:15 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:15 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:16 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:16 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:17 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:17 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 0.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Creating commit token because
I am the rep.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Storing new sequence id for
ring 15c
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering COMMIT state.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering RECOVERY state.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] position [0] member 15.15.2.10:
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] previous ring seq 344 rep
15.15.2.10
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] aru e high delivered e
received flag 1
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Did not need to originate any
messages in recovery.
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] Sending initial ORF token
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering OPERATIONAL state.
Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.
Jun 20 06:02:18 node1 ccsd[6090]: Error while processing connect:
Connection refused
Jun 20 06:02:18 node1 openais[6098]: [TOTEM] entering GATHER state from 9.
Jun 20 06:02:18 node1 ccsd[6090]: Cluster is not quorate.  Refusing
connection.

And the quorum disk:

[root at node2 ~]# mkqdisk -L -d
kqdisk v0.6.0
/dev/mapper/vg_qdisk-lv_qdisk:
/dev/vg_qdisk/lv_qdisk:
        Magic:                eb7a62c2
        Label:                cluster_qdisk
        Created:              Thu Jun  7 09:23:34 2012
        Host:                 node1
        Kernel Sector Size:   512
        Recorded Sector Size: 512

Status block for node 1
        Last updated by node 2
        Last updated on Wed Jun 20 06:17:23 2012
        State: Evicted
        Flags: 0000
        Score: 0/0
        Average Cycle speed: 0.000500 seconds
        Last Cycle speed: 0.000000 seconds
        Incarnation: 4fe1a06c4fe1a06c
Status block for node 2
        Last updated by node 2
        Last updated on Wed Jun 20 07:09:38 2012
        State: Master
        Flags: 0000
        Score: 0/0
        Average Cycle speed: 0.001000 seconds
        Last Cycle speed: 0.000000 seconds
        Incarnation: 4fe1a06c4fe1a06c

In the other node I don't see any errors in /var/log/messages. One strange
thing is that if I start cman on both nodes at the same time, everything
works fine and both nodes quorate (until I reboot one node and the problem
appears). I've checked that multicast is working properly. With iperf I can
send a receive multicast paquets. Moreover I've seen with tcpdump the
paquets that openais send when cman is trying to start. I've readed about a
bug in RH 5.3 with the same behaviour, but it is solved in RH 5.4.

I don't have Selinux enabled, and Iptables are also disabled. Here is the
cluster.conf simplified (with less services and resources). I want to point
out one thing. I have allow_kill="0" in order to avoid fencing errors when
quorum tries to fence a failed node. As <fence/> is empty, before this
stanza I got a lot of messages in /var/log/messages with failed fencing.

<?xml version="1.0"?>
<cluster alias="test_cluster" config_version="15" name="test_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0"
post_join_delay="-1"/>
        <clusternodes>
                <clusternode name="node1-hb" nodeid="1" votes="1">
                        <fence/>
                </clusternode>
                <clusternode name="node2-hb" nodeid="2" votes="1">
                        <fence/>
                </clusternode>
        </clusternodes>
        <cman two_node="0" expected_votes="3"/>
        <fencedevices/>

        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="etest_cluster_fo"
nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode name="node1-hb"
priority="1"/>
                                <failoverdomainnode name="node2-hb"
priority="2"/>
                        </failoverdomain>
                </failoverdomains>
        <resources/>
        <service autostart="1" domain="test_cluster_fo" exclusive="0"
name="postgres" recovery="relocate">
                <ip address="172.24.119.44" monitor_link="1"/>
                <lvm name="vg_postgres" vg_name="vg_postgres"
lv_name="postgres"/>

                <fs device="/dev/vg_postgres/postgres" force_fsck="1"
force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="postgres"
self_fence="0"/>

                <script file="/etc/init.d/postgresql" name="postgres">
                </script>
        </service>
        </rm>
        <totem consensus="4000" join="60" token="20000"
token_retransmits_before_loss_const="20"/>
    <quorumd allow_kill="0" interval="1" label="cluster_qdisk" tko="10"
votes="1">
                <heuristic program="/usr/share/cluster/check_eth_link.sh
eth0" score="1" interval="2" tko="3"/>
        </quorumd>
 </cluster>

The /etc/hosts:
172.24.119.10 node1
172.24.119.34 node2
15.15.2.10 node1-hb node1-hb.localdomain
15.15.2.11 node2-hb node2-hb.localdomain

And the versions:
Red Hat Enterprise Linux Server release 5.7 (Tikanga)
cman-2.0.115-85.el5
rgmanager-2.0.52-21.el5
openais-0.80.6-30.el5

I don't know what else I should try, so if you can give me some ideas, I
will be very pleased.

Regards, Javi.
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120620/d6ad95f4/attachment.htm>