[Linux-cluster] DLM problem

I have a simple test cluster up and running (RHEL 6 HA) on three vmware guests. Each vmware guest has 3 vnic's.

After booting a node, I often get a dead rgmanager:

[root syseng1-vm ~]# service rgmanager status
rgmanager dead but pid file exists

Cluster is otherwise OK

[root syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local
 syseng2-vm                               2 Online
 syseng3-vm                               3 Online

There is a service running on node2 but clustat has no info on that.

[root syseng1-vm ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: RHEL6Test
Cluster Id: 36258
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Ports Bound: 0
Node name: syseng1-[CENSORED]
Node ID: 1
Multicast addresses:
Node addresses:

The syslog has some info:

Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed
Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set
Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107

The fix is always the same:

[root syseng1-vm ~]# service cman restart
Stopping cluster:
   Leaving fence domain...                                 [  OK  ]
   Stopping gfs_controld...                                [  OK  ]
   Stopping dlm_controld...                                [  OK  ]
   Stopping fenced...                                      [  OK  ]
   Stopping cman...                                        [  OK  ]
   Waiting for corosync to shutdown:                       [  OK  ]
   Unloading kernel modules...                             [  OK  ]
   Unmounting configfs...                                  [  OK  ]
Starting cluster:
   Checking Network Manager...                             [  OK  ]
   Global setup...                                         [  OK  ]
   Loading kernel modules...                               [  OK  ]
   Mounting configfs...                                    [  OK  ]
   Starting cman...                                        [  OK  ]
   Waiting for quorum...                                   [  OK  ]
   Starting fenced...                                      [  OK  ]
   Starting dlm_controld...                                [  OK  ]
   Starting gfs_controld...                                [  OK  ]
   Unfencing self...                                       [  OK  ]
   Joining fence domain...                                 [  OK  ]

[root syseng1-vm ~]# service rgmanager restart
Stopping Cluster Service Manager:                          [  OK  ]
Starting Cluster Service Manager:                          [  OK  ]

[root syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local, rgmanager
 syseng2-vm                               2 Online, rgmanager
 syseng3-vm                               3 Online

Service Name Owner (Last) State ------- ---- ----- ------ ----- service:TestDB syseng2-vm started

Sometimes restarting rgmanager hangs and the node needs to be rebooted.

my cluster.conf:

<?xml version="1.0"?>
<cluster config_version="9" name="RHEL6Test">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternode name="syseng1-vm" nodeid="1" votes="1">
<method name="1">
<device name="syseng1-vm"/>
<clusternode name="syseng2-vm" nodeid="2" votes="1">
<method name="1">
<device name="syseng2-vm"/>
<clusternode name="syseng3-vm" nodeid="3" votes="1">
<method name="1">
<device name="syseng3-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/> <fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/> <fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/>
<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="syseng1-vm" priority="1"/>
<failoverdomainnode name="syseng2-vm" priority="1"/>
<failoverdomainnode name="syseng3-vm" priority="1"/>
<ip address="" monitor_link="on" sleeptime="10"/>
<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/>
<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>
<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate">
<ip ref=""/>
<fs ref="SharedDisk"/>
<script ref="postgresql"/>

Anyone have any ideas in what is going on?

