[Linux-cluster] Strange behaviours in two-node cluster

Javier Vela jvdiago at gmail.com
Mon Jul 16 18:46:27 UTC 2012


Hi,

I set two_node=0 in purpose, because of I use a quorum disk with one
additional vote. If one one fails, I still have two votes, and the cluster
remains quorate, avoiding the split-brain situation. Is this approach
wrong? In my tests, this aspect of the quorum worked well.

Fencing works very well. When something happens, the fencing kills the
faulting server without any problems.

The first time I ran into problems I cheked multicast traffic between the
nodes with iperf and everything appeared to be OK. What I don't know is how
works the purge you said. I didn't know that any purge was running
whatsoever. How can I check if is happening? Moreover, when I did the test
only one cluster was running. Now there are 3 cluster running in the same
virtual switch.


Software:

Red Hat Enterprise Linux Server release 5.7 (Tikanga)
cman-2.0.115-85.el5
rgmanager-2.0.52-21.el5
openais-0.80.6-30.el5


 Regards, Javi

2012/7/16 Digimer <lists at alteeve.ca>

> Why did you set 'two_node="0" expected_votes="3"' on a two node cluster?
> With this, losing a node will mean you lose quorum and all cluster
> activity will stop. Please change this to 'two_node="1"
> expected_votes="1"'.
>
> Did you confirm that your fencing actually works? Does 'fence_node
> node1' and 'fence_node node2' actually kill the target?
>
> Are you running into multicast issues? If your switch (virtual or real)
> purges multicast groups periodically, it will break the cluster.
>
> What version of the cluster software and what distro are you using?
>
> Digimer
>
>
> On 07/16/2012 12:03 PM, Javier Vela wrote:
> > Hi, two weeks ago I asked for some help building a two-node cluster with
> > HA-LVM. After some e-mails, finally I got my cluster working. The
> > problem now is that sometimes, and in some clusters (I have three
> > clusters with the same configuration), I got very strange behaviours.
> >
> > #1 Openais detects some problem and shutdown itself. The network is Ok,
> > is a virtual device in vmware, shared with the other cluster hearbet
> > networks, and only happens in one cluster. The error messages:
> >
> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] FAILED TO RECEIVE
> > Jul 16 08:50:32 node1 openais[3641]: [TOTEM] entering GATHER state from
> 6.
> > Jul 16 08:50:36 node1 openais[3641]: [TOTEM] entering GATHER state from 0
> >
> > Do you know what can I check in order to solve the problem? I don't know
> > from where I should start. What makes Openais to not receive messages?
> >
> >
> > #2 I'm getting a lot of RGmanager errors when rgmanager tries to change
> > the service status. i.e: clusvdcam -d service. Always happens when I
> > have the two nodes UP. If I shutdown one node, then the command finishes
> > succesfully. Prior to execute the command, I always check the status
> > with clustat, and everything is OK:
> >
> > clurgmgrd[5667]: <err> #52: Failed changing RG status
> >
> > Another time, what can I check in order to detect problems with
> > rgmanager that clustat and cman_tool doesn't show?
> >
> > #3 Sometimes, not always, a node that has been fenced cannot join the
> > cluster after the reboot. With clustat I can see that there is quorum:
> >
> > clustat:
> > [root at node2 ~]# clustat
> > Cluster Status test_cluster @ Mon Jul 16 05:46:57 2012
> > Member Status: Quorate
> >
> >  Member Name                             ID   Status
> >  ------ ----                             ---- ------
> >  node1-hb                                  1 Offline
> >  node2-hb                               2 Online, Local, rgmanager
> >  /dev/disk/by-path/pci-0000:02:01.0-scsi-    0 Online, Quorum Disk
> >
> >  Service Name                   Owner (Last)                   State
> >  ------- ----                   ----- ------                   -----
> >  service:test                   node2-hb                  started
> >
> > The log show how node2 fenced node1:
> >
> > node2 messages
> > Jul 13 04:00:31 node2 fenced[4219]: node1 not a cluster member after 0
> > sec post_fail_delay
> > Jul 13 04:00:31 node2 fenced[4219]: fencing node "node1"
> > Jul 13 04:00:36 node2 clurgmgrd[4457]: <info> Waiting for node #1 to be
> > fenced
> > Jul 13 04:01:04 node2 fenced[4219]: fence "node1" success
> > Jul 13 04:01:06 node2 clurgmgrd[4457]: <info> Node #1 fenced; continuing
> >
> > But the node that tries to join the cluster says that there isn't
> > quorum. Finally. It finishes inquorate, without seeing node1 and the
> > quorum disk.
> >
> > node1 messages
> > Jul 16 05:48:19 node1 ccsd[4207]: Error while processing connect:
> > Connection refused
> > Jul 16 05:48:19 node1 ccsd[4207]: Cluster is not quorate.  Refusing
> > connection.
> >
> > Have something in common the three errors?  What should I check? I've
> > discarded cluster configuration because cluster is working, and the
> > errors doesn't appear in all the nodes. The most annoying error
> > cureently is the #1. Every 10-15 minutes Openais fails and the nodes
> > gets fenced. I attach the cluster.conf.
> >
> > Thanks in advance.
> >
> > Regards, Javi
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com
>
>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20120716/e7b9e685/attachment.htm>


More information about the Linux-cluster mailing list