[Linux-cluster] RHCS 4-node cluster: Networking/Membership issues

Wed May 6 22:44:15 UTC 2009

Hi again folks...

One update here:

- I'd removed bonding for cluster heartbeat (bond0) and setup it direct on
eth0 for all nodes. This solves the issue for membership.

Now I can boot up all 4 nodes, join fence domain, start clvmd on them.
Everything is stable and I didn't see random messages about "openais
retransmit" anymore.

Of course, I still have a problem :).

I've 1 GFS filesystem and 16 GFS2 filesystems. I can mount all filesystems
on node1 and node2 (same build/switch), but when I try to run "service gfs2
start" on node3 or node4 (another build/switch) the things becomes unstable
and whole cluster fail with infinity messages about "cpg_mcast_retry
RETRY_NUMBER".

Log can be found here: http://pastebin.com/m2f26ab1d

What apparently happened is that without bonding setup the network layer
becomes more "simple" and could handle with membership but still cant handle
with GFS/GFS2 heartbeat.

I've set nodes to talk IGMPv2, as said at:
http://archives.free.net.ph/message/20081001.223026.9cf6d7bf.de.html

Well.. any hints?

Thanks again.

--

Flávio do Carmo Júnior aka waKKu

On Thu, Apr 30, 2009 at 1:41 PM, Flavio Junior <billpp at gmail.com> wrote:

> Hi Abraham, thanks for your answer.
>
> I'd configured your suggestion to cluster.conf but still gets the same
> problem.
>
> Here is what I did:
> * Disable cman init script on boot for all nodes
> * Edit config file and copy it for all nodes
> * reboot all
> * start cman on node1 (OK)
> * start cman on node2 (OK)
> * start cman on node3 (problems to become member, fence node2)
>
> Here is the log file with this process 'til the fence:
> http://pastebin.com/f477e7114
>
> PS: node1 and node2 as on the same switch at site1. node3 and node4 as
> on the same switch at site2.
>
> Thanks again, any other suggestions ?
>
> I dont know if it would help but, is corosync a feasible option for
> production use?
>
> --
>
> Flávio do Carmo Júnior aka waKKu
>
> On Wed, Apr 29, 2009 at 10:19 PM, Abraham Alawi <a.alawi at auckland.ac.nz>
> wrote:
> > If not tried already, the following settings in cluster.conf might help
> > especially "clean_start"
> >
> > <fence_daemon clean_start="1" post_fail_delay="5" post_join_delay="15"/>
> > clean_start --> assume the cluster is in healthy state upon startup
> > post_fail_delay --> seconds to wait before fencing a node that thinks it
> > should be fenced (i.e. lost connection with)
> > post_join_delay --> seconds to wait before fencing any node that should
> be
> > fenced upon startup (right after joining)
> >
> > On 30/04/2009, at 8:21 AM, Flavio Junior wrote:
> >
> >> Hi folks,
> >>
> >> I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've
> >> another 2-node cluster using CentOS 5.3 without problem.
> >>
> >> Well.. My scenario is as follow:
> >>
> >> * System configuration and info: http://pastebin.com/f41d63624
> >>
> >> * Network:
> >>
> http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg
> >>  * Switches on loop are 3Com 2924 (or 2948)-SFP
> >>  * Have STP enabled (RSTP auto)
> >>  * IGMP Snooping Disabled as:
> >>
> >>
> http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/
> >> comment 32
> >>  * Yellow lines are a fiber link 990ft (330mts) single-mode
> >>  * I'm using a dedicated tagged VLAN for cluster-heartbeat
> >>  * I'm using 2 NIC's with bonding mode=1 (active/backup) for
> >> heartbeat and 4 NIC's to "public"
> >>  * Every node has your public four cables plugged on same switch and
> >> Link-Aggregation on it
> >>  * Looking to the picture, that 2 switches with below fiber link is
> >> where the nodes are plugged. 2 nodes each build.
> >>
> >> SAN: http://img139.imageshack.us/img139/642/clusters.jpg
> >>  * Switches: Brocade TotalStorage 16SAN-B
> >>  * Storages: IBM DS4700 72A (using ERM for sync replication (storage
> >> level))
> >>
> >> My problem is:
> >>
> >> I can't get the 4 nodes up. Every time the fourth (sometimes even the
> >> third) node becomes online i got one or two of them fenced. I keep
> >> getting messages about openais/cman, cpg_mcast_joined very often:
> >> --- snipped ---
> >> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900
> >> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000
> >> --- snipped ---
> >>
> >> Is really seldom the times I can get a node to boot up and join on
> >> fence domain, almost every time it hangs and i need to reboot and try
> >> again or either reboot, enter single mode, disable cman, reboot, keep
> >> trying to service cman start/stop. Sometimes another nodes can see the
> >> node in domain but boot keeps hangs on "Starting fenced..."
> >>
> >> ########
> >> [root at athos ~]# cman_tool services
> >> type             level name     id       state
> >> fence            0     default  00010001 none
> >> [1 3 4]
> >> dlm              1     clvmd    00020001 none
> >> [1 3 4]
> >> [root at athos ~]# cman_tool nodes -f
> >> Node  Sts   Inc   Joined               Name
> >>  0   M      0   2009-04-29 15:16:47
> >> /dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b
> >>  1   M   7556   2009-04-29 15:16:35  athos-priv
> >>      Last fenced:   2009-04-29 15:13:49 by athos-ipmi
> >>  2   X   7820                        porthos-priv
> >>      Last fenced:   2009-04-29 15:31:01 by porthos-ipmi
> >>      Node has not been fenced since it went down
> >>  3   M   7696   2009-04-29 15:27:15  aramis-priv
> >>      Last fenced:   2009-04-29 15:24:17 by aramis-ipmi
> >>  4   M   8232   2009-04-29 16:12:34  dartagnan-priv
> >>      Last fenced:   2009-04-29 16:09:53 by dartagnan-ipmi
> >> [root at athos ~]# ssh root at aramis-priv
> >> ssh: connect to host aramis-priv port 22: Connection refused
> >> [root at athos ~]# ssh root at dartagnan-priv
> >> ssh: connect to host dartagnan-priv port 22: Connection refused
> >> [root at athos ~]#
> >> #########
> >>
> >> (I know how unreliable is ssh, but I'm seeing the console screen
> >> hanged.. Just trying to show it)
> >>
> >>
> >> The BIG log file: http://pastebin.com/f453c220
> >> Every entry on this log after 16:54h is when node2 (porthos-priv
> >> 172.16.1.2) was booting and hanged on "Starting fenced..."
> >>
> >>
> >> I've no more ideias to try solve this problem, any hints is
> >> appreciated. If you need any other info, just tell me how to get it
> >> and I'll post just after I read.
> >>
> >>
> >> Very thanks, in advance.
> >>
> >> --
> >>
> >> Flávio do Carmo Júnior aka waKKu
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> > ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> > Abraham Alawi
> >
> > Unix/Linux Systems Administrator
> > Science IT
> > University of Auckland
> > e: a.alawi at auckland.ac.nz
> > p: +64-9-373 7599, ext#: 87572
> >
> > ''''''''''''''''''''''''''''''''''''''''''''''''''''''
> >
> >
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090506/818e5731/attachment.htm>