[Linux-cluster] Strange behaviours in two-node cluster

Mon Jul 16 16:03:34 UTC 2012

Hi, two weeks ago I asked for some help building a two-node cluster with
HA-LVM. After some e-mails, finally I got my cluster working. The problem
now is that sometimes, and in some clusters (I have three clusters with the
same configuration), I got very strange behaviours.

#1 Openais detects some problem and shutdown itself. The network is Ok, is
a virtual device in vmware, shared with the other cluster hearbet networks,
and only happens in one cluster. The error messages:

Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from 6.
Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from 0

Do you know what can I check in order to solve the problem? I don't know
from where I should start. What makes Openais to not receive messages?

#2 I'm getting a lot of RGmanager errors when rgmanager tries to change the
service status. i.e: clusvdcam -d service. Always happens when I have the
two nodes UP. If I shutdown one node, then the command finishes
succesfully. Prior to execute the command, I always check the status with
clustat, and everything is OK:

clurgmgrd[5667]: <err> #52: Failed changing RG status

Another time, what can I check in order to detect problems with rgmanager
that clustat and cman_tool doesn't show?

#3 Sometimes, not always, a node that has been fenced cannot join the
cluster after the reboot. With clustat I can see that there is quorum:

clustat:
[root at node2 ~]# clustat
Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
Member Status: Quorate

 Member Name                             ID   Status
 ------ ----                             ---- ------
 node1-hb                                  1 Offline
 node2-hb                               2 Online, Local, rgmanager
 /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk

 Service Name                   Owner (Last)                   State
 ------- ----                   ----- ------                   -----
 service:test                   node2-hb                  started

The log show how node2 fenced node1:

node2 messages
Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0 sec
post_fail_delay
Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
fenced
Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing

But the node that tries to join the cluster says that there isn't quorum.
Finally. It finishes inquorate, without seeing node1 and the quorum disk.

node1 messages
Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
Connection refused
Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
connection.

Have something in common the three errors?  What should I check? I've
discarded cluster configuration because cluster is working, and the  errors
doesn't appear in all the nodes. The most annoying error cureently is the
#1. Every 10-15 minutes Openais fails and the nodes gets fenced. I attach
the cluster.conf.

Thanks in advance.

Regards, Javi
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120716/ac83a827/attachment.htm>
-------------- next part --------------
<?xml version="1.0"?>
<cluster alias="test_cluster" config_version="3" name="test_cluster">
        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="6"/>
        <clusternodes>
                <clusternode name="node1" nodeid="1" votes="1">
                         <fence>
                                <method name="fence">
                                        <device name="fence-vmware" uuid="77777777777777777777777"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node2" nodeid="2" votes="1">
                         <fence>
                                <method name="fence">
                                        <device name="fence-vmware" uuid="777777777777777777777771"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <cman two_node="0" expected_votes="3"/>
        <fencedevices>
                <fencedevice agent="fence_vmware_soap" ipaddr="XX.XX.XX.XX" login="XXX" passwd="XXXXX" ssl="1" action="reboot" name="fence-vmware"/>
        </fencedevices>
        <rm log_facility="local4" log_level="7">
                <failoverdomains>
                        <failoverdomain name="test_cluster_fo" nofailback="1" ordered="1" restricted="1">
                                <failoverdomainnode name="node1" priority="1"/>
                                <failoverdomainnode name="node2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
        <resources/>
        <service autostart="1" domain="test_cluster_fo" exclusive="0" name="web_service" recovery="relocate">
                <ip address="192.168.1.1" monitor_link="1"/>
                <lvm name="vg_www" vg_name="vg_www" lv_name="www"/>
                <lvm name="vg_mysql" vg_name="vg_mysql" lv_name="mysql"/>
                <fs device="/dev/vg_www/www" force_fsck="1" force_unmount="1" fstype="ext3" mountpoint="/var/www" name="www" self_fence="0"/>
                <fs device="/dev/vg_mysql/mysql" force_fsck="1" force_unmount="1" fstype="ext3" mountpoint="/var/lib/pgsql" name="mysql" self_fence="0"/>
                <script file="/etc/init.d/mysql" name="mysql"/>
				<script file="/etc/init.d/httpd" name="httpd"/>
        </service>
        </rm>
        <totem consensus="4000" join="60" token="20000" token_retransmits_before_loss_const="20"/>
        <quorumd interval="1" label="test_qdisk" tko="10" votes="1">
                <heuristic program="/usr/share/cluster/check_eth_link.sh eth0" score="1" interval="2" tko="3"/>
        </quorumd>
 </cluster>