[Linux-cluster] clvmd hangs when third node tries to connect to cluster

Mon Oct 29 10:51:56 UTC 2007

Hi there,

I have a cluster with three nodes (all clone HL DL380 G4s) attached to
a Fibre SAN (HP MSA1000) and serving a number of GFS filesystems.  My
OS is Ubuntu Dapper (6.06) and my kernel is 2.6.15-29-amd64-server.
These machines have been working nicely for a long time.

On the weekend I "apt-get updated" to the latest version of the Dapper
redhat-cluster-suite package (1.20060222-0ubuntu6.1).  Now, when the
cluster boots the first two nodes to come up are able to see the GFS
filesystem. However, the third node to come up hangs at the point of
starting the clvm service.  Concomitantly, I see the following message
in /var/log/syslog of one of the other machines in the cluster:

Oct 28 14:42:18 machinea kernel: [ 1681.325152] CMAN: node machinec rejoining
Oct 28 14:42:20 machinea kernel: [ 1683.528299] Extra connection from
node 2 attempted

It does not seem to matter which order the nodes come up in - it is
always the third node to boot that will hang when starting clvmd.  I
have included my cluster.conf file below for reference - I can include
any additional diagnostics as required.

Any help would be most appreciated!

Stephen

<?xml version="1.0"?>
<cluster config_version="14" name="alpha_cluster">
       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
       <clusternodes>
               <clusternode name="machineaint" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="machinea_ILO"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="machinebint" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="machineb_ILO"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="machinecint" votes="1">
                       <fence>
                               <method name="1">
                                       <device name="machinec_ILO"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <cman/>
       <fencedevices>
               <fencedevice agent="fence_ilo"
hostname="192.168.81.200" login="Login" name="machinea_ILO"
passwd="Passwd"/>
               <fencedevice agent="fence_ilo"
hostname="192.168.81.199" login="Login" name="machineb_ILO"
passwd="Passwd"/>
               <fencedevice agent="fence_ilo"
hostname="192.168.81.197" login="Login" name="machinec_ILO"
passwd="Passwd"/>
       </fencedevices>
       <rm>
               <failoverdomains>
                       <failoverdomain name="fileservers" ordered="0"
restricted="0">
                               <failoverdomainnode name="machineaint"
priority="1"/>
                               <failoverdomainnode name="machinebint"
priority="1"/>
                               <failoverdomainnode name="machinecint"
priority="1"/>
                       </failoverdomain>
                       <failoverdomain name="backupers" ordered="0"
restricted="1">
                               <failoverdomainnode name="machineaint"
priority="1"/>
                               <failoverdomainnode name="machinebint"
priority="1"/>
                       </failoverdomain>
               </failoverdomains>
               <resources>
                       <ip address="192.168.81.98" monitor_link="1"/>
               </resources>
               <service autostart="1" domain="fileservers"
exclusive="1" name="fileserver_ip">
                       <ip ref="192.168.81.98"/>
               </service>
               <service autostart="1" domain="backupers" name="backups">
                       <script file="/etc/init.d/dsmcad-init"
name="TSM backup script"/>
               </service>
       </rm>
</cluster>