[Linux-cluster] RHCS 4-node cluster: Networking/Membership issues

Abraham Alawi a.alawi at auckland.ac.nz
Thu Apr 30 01:19:42 UTC 2009


If not tried already, the following settings in cluster.conf might  
help especially "clean_start"

<fence_daemon clean_start="1" post_fail_delay="5" post_join_delay="15"/>
clean_start --> assume the cluster is in healthy state upon startup
post_fail_delay --> seconds to wait before fencing a node that thinks  
it should be fenced (i.e. lost connection with)
post_join_delay --> seconds to wait before fencing any node that  
should be fenced upon startup (right after joining)

On 30/04/2009, at 8:21 AM, Flavio Junior wrote:

> Hi folks,
>
> I've been trying to set up a 4-node RHCS+GFS cluster for awhile. I've
> another 2-node cluster using CentOS 5.3 without problem.
>
> Well.. My scenario is as follow:
>
> * System configuration and info: http://pastebin.com/f41d63624
>
> * Network: http://www.uploadimagens.com/upload/2ac9074fbb10c2479c59abe419880dc8.jpg
>  * Switches on loop are 3Com 2924 (or 2948)-SFP
>  * Have STP enabled (RSTP auto)
>  * IGMP Snooping Disabled as:
> http://magazine.redhat.com/2007/08/23/automated-failover-and-recovery-of-virtualized-guests-in-advanced-platform/
> comment 32
>  * Yellow lines are a fiber link 990ft (330mts) single-mode
>  * I'm using a dedicated tagged VLAN for cluster-heartbeat
>  * I'm using 2 NIC's with bonding mode=1 (active/backup) for
> heartbeat and 4 NIC's to "public"
>  * Every node has your public four cables plugged on same switch and
> Link-Aggregation on it
>  * Looking to the picture, that 2 switches with below fiber link is
> where the nodes are plugged. 2 nodes each build.
>
> SAN: http://img139.imageshack.us/img139/642/clusters.jpg
>  * Switches: Brocade TotalStorage 16SAN-B
>  * Storages: IBM DS4700 72A (using ERM for sync replication (storage  
> level))
>
> My problem is:
>
> I can't get the 4 nodes up. Every time the fourth (sometimes even the
> third) node becomes online i got one or two of them fenced. I keep
> getting messages about openais/cman, cpg_mcast_joined very often:
> --- snipped ---
> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1098900
> Apr 29 16:08:23 athos groupd[5393]: cpg_mcast_joined retry 1099000
> --- snipped ---
>
> Is really seldom the times I can get a node to boot up and join on
> fence domain, almost every time it hangs and i need to reboot and try
> again or either reboot, enter single mode, disable cman, reboot, keep
> trying to service cman start/stop. Sometimes another nodes can see the
> node in domain but boot keeps hangs on "Starting fenced..."
>
> ########
> [root at athos ~]# cman_tool services
> type             level name     id       state
> fence            0     default  00010001 none
> [1 3 4]
> dlm              1     clvmd    00020001 none
> [1 3 4]
> [root at athos ~]# cman_tool nodes -f
> Node  Sts   Inc   Joined               Name
>   0   M      0   2009-04-29 15:16:47
> /dev/disk/by-id/scsi-3600a0b800048834e000014fb49dcc47b
>   1   M   7556   2009-04-29 15:16:35  athos-priv
>       Last fenced:   2009-04-29 15:13:49 by athos-ipmi
>   2   X   7820                        porthos-priv
>       Last fenced:   2009-04-29 15:31:01 by porthos-ipmi
>       Node has not been fenced since it went down
>   3   M   7696   2009-04-29 15:27:15  aramis-priv
>       Last fenced:   2009-04-29 15:24:17 by aramis-ipmi
>   4   M   8232   2009-04-29 16:12:34  dartagnan-priv
>       Last fenced:   2009-04-29 16:09:53 by dartagnan-ipmi
> [root at athos ~]# ssh root at aramis-priv
> ssh: connect to host aramis-priv port 22: Connection refused
> [root at athos ~]# ssh root at dartagnan-priv
> ssh: connect to host dartagnan-priv port 22: Connection refused
> [root at athos ~]#
> #########
>
> (I know how unreliable is ssh, but I'm seeing the console screen
> hanged.. Just trying to show it)
>
>
> The BIG log file: http://pastebin.com/f453c220
> Every entry on this log after 16:54h is when node2 (porthos-priv
> 172.16.1.2) was booting and hanged on "Starting fenced..."
>
>
> I've no more ideias to try solve this problem, any hints is
> appreciated. If you need any other info, just tell me how to get it
> and I'll post just after I read.
>
>
> Very thanks, in advance.
>
> --
>
> Flávio do Carmo Júnior aka waKKu
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster

''''''''''''''''''''''''''''''''''''''''''''''''''''''
Abraham Alawi

Unix/Linux Systems Administrator
Science IT
University of Auckland
e: a.alawi at auckland.ac.nz
p: +64-9-373 7599, ext#: 87572

''''''''''''''''''''''''''''''''''''''''''''''''''''''





More information about the Linux-cluster mailing list