On 07/16/2012 12:03 PM, Javier Vela wrote:
> Hi, two weeks ago I asked for some help building a two-node cluster with
> HA-LVM. After some e-mails, finally I got my cluster working. The
> problem now is that sometimes, and in some clusters (I have three
> clusters with the same configuration), I got very strange behaviours.
> #1 Openais detects some problem and shutdown itself. The network is Ok,
> is a virtual device in vmware, shared with the other cluster hearbet
> networks, and only happens in one cluster. The error messages:
> Jul 16 08:50:32 node1 openais: [TOTEM] FAILED TO RECEIVE
> Jul 16 08:50:32 node1 openais: [TOTEM] entering GATHER state from 6.
> Jul 16 08:50:36 node1 openais: [TOTEM] entering GATHER state from 0
> Do you know what can I check in order to solve the problem? I don't know
> from where I should start. What makes Openais to not receive messages?
> #2 I'm getting a lot of RGmanager errors when rgmanager tries to change
> the service status. i.e: clusvdcam -d service. Always happens when I
> have the two nodes UP. If I shutdown one node, then the command finishes
> succesfully. Prior to execute the command, I always check the status
> with clustat, and everything is OK:
> clurgmgrd: <err> #52: Failed changing RG status
> Another time, what can I check in order to detect problems with
> rgmanager that clustat and cman_tool doesn't show?
> #3 Sometimes, not always, a node that has been fenced cannot join the
> cluster after the reboot. With clustat I can see that there is quorum:
> [root node2 ~]# clustat
> Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
> Member Status: Quorate
> Member Name ID Status
> ------ ---- ---- ------
> node1-hb 1 Offline
> node2-hb 2 Online, Local, rgmanager
> /dev/disk/by-path/pci-0000:02:01.0-scsi- 0 Online, Quorum Disk
> Service Name Owner (Last) State
> ------- ---- ----- ------ -----
> service:test node2-hb started
> The log show how node2 fenced node1:
> node2 messages
> Jul 13 04:00:31 node2 fenced: node1 not a cluster member after 0
> sec post_fail_delay
> Jul 13 04:00:31 node2 fenced: fencing node "node1"
> Jul 13 04:00:36 node2 clurgmgrd: <info> Waiting for node #1 to be
> Jul 13 04:01:04 node2 fenced: fence "node1" success
> Jul 13 04:01:06 node2 clurgmgrd: <info> Node #1 fenced; continuing
> But the node that tries to join the cluster says that there isn't
> quorum. Finally. It finishes inquorate, without seeing node1 and the
> quorum disk.
> node1 messages
> Jul 16 05:48:19 node1 ccsd: Error while processing connect:
> Connection refused
> Jul 16 05:48:19 node1 ccsd: Cluster is not quorate. Refusing
> Have something in common the three errors? What should I check? I've
> discarded cluster configuration because cluster is working, and the
> errors doesn't appear in all the nodes. The most annoying error
> cureently is the #1. Every 10-15 minutes Openais fails and the nodes
> gets fenced. I attach the cluster.conf.
> Thanks in advance.
> Regards, Javi