[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] DLM problem





On Thu, Mar 17, 2011 at 10:29 PM, Richard Allen <ra ra is> wrote:
I have a simple test cluster up and running (RHEL 6 HA) on three vmware guests.  Each vmware guest has 3 vnic's.

After booting a node, I often get a dead rgmanager:

[root syseng1-vm ~]# service rgmanager status
rgmanager dead but pid file exists

Cluster is otherwise OK

[root syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:10:38 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local
 syseng2-vm                               2 Online
 syseng3-vm                               3 Online

There is a service running on node2 but clustat has no info on that.



[root syseng1-vm ~]# cman_tool status
Version: 6.2.0
Config Version: 9
Cluster Name: RHEL6Test
Cluster Id: 36258
Cluster Member: Yes
Cluster Generation: 88
Membership state: Cluster-Member
Nodes: 3
Expected votes: 3
Total votes: 3
Node votes: 1
Quorum: 2
Active subsystems: 1
Flags:
Ports Bound: 0
Node name: syseng1-[CENSORED]
Node ID: 1
Multicast addresses: 239.192.141.48
Node addresses: 10.10.16.11


The syslog has some info:

Mar 17 15:47:55 syseng1-vm rgmanager[2463]: Quorum formed
Mar 17 15:47:55 syseng1-vm kernel: dlm: no local IP address has been set
Mar 17 15:47:55 syseng1-vm kernel: dlm: cannot start dlm lowcomms -107


The fix is always the same:

[root syseng1-vm ~]# service cman restart
Stopping cluster:
  Leaving fence domain...                                 [  OK  ]
  Stopping gfs_controld...                                [  OK  ]
  Stopping dlm_controld...                                [  OK  ]
  Stopping fenced...                                      [  OK  ]
  Stopping cman...                                        [  OK  ]
  Waiting for corosync to shutdown:                       [  OK  ]
  Unloading kernel modules...                             [  OK  ]
  Unmounting configfs...                                  [  OK  ]
Starting cluster:
  Checking Network Manager...                             [  OK  ]
  Global setup...                                         [  OK  ]
  Loading kernel modules...                               [  OK  ]
  Mounting configfs...                                    [  OK  ]
  Starting cman...                                        [  OK  ]
  Waiting for quorum...                                   [  OK  ]
  Starting fenced...                                      [  OK  ]
  Starting dlm_controld...                                [  OK  ]
  Starting gfs_controld...                                [  OK  ]
  Unfencing self...                                       [  OK  ]
  Joining fence domain...                                 [  OK  ]

[root syseng1-vm ~]# service rgmanager restart
Stopping Cluster Service Manager:                          [  OK  ]
Starting Cluster Service Manager:                          [  OK  ]



[root syseng1-vm ~]# clustat
Cluster Status for RHEL6Test @ Thu Mar 17 16:22:01 2011
Member Status: Quorate

 Member Name                                               ID   Status
 ------ ----                                               ---- ------
 syseng1-vm                               1 Online, Local, rgmanager
 syseng2-vm                               2 Online, rgmanager
 syseng3-vm                               3 Online

 Service Name                                     Owner (Last)                                     State
 ------- ----                                     ----- ------                                     -----
 service:TestDB                                   syseng2-vm                  started



Sometimes restarting rgmanager hangs and the node needs to be rebooted.


I'm running libvirtd setup on top of KVM/Qemu and I have similar experience to yours. I have to force power off the VMs to be able to reboot them. I also loose quorum from time to time, etc. I also noticed bad performance from gfs2 inside such a setup and I'm starting to think it has something to do with virtualization and there is something that we simply don't know about the cluster manager. Probably some tweaking that is not yet in the docs. I'm using SL6, by the way, which is very, very close to RHEL 6. I unfortunatelly don't have the time to test with CentOS5 on the VMs, or with the most recent Fedora. Probably it is something specific to RHEL 6?
 
my cluster.conf:


<?xml version="1.0"?>
<cluster config_version="9" name="RHEL6Test">
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
<clusternodes>
<clusternode name="syseng1-vm" nodeid="1" votes="1">
<fence>
<method name="1">
<device name="syseng1-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng2-vm" nodeid="2" votes="1">
<fence>
<method name="1">
<device name="syseng2-vm"/>
</method>
</fence>
</clusternode>
<clusternode name="syseng3-vm" nodeid="3" votes="1">
<fence>
<method name="1">
<device name="syseng3-vm"/>
</method>
</fence>
</clusternode>
</clusternodes>
<cman/>
<fencedevices>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng1-vm" passwd="[CENSORED]" port="syseng1-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng2-vm" passwd="[CENSORED]" port="syseng2-vm"/>
<fencedevice agent="fence_vmware" ipaddr="vcenter-lab" login="Administrator" name="syseng3-vm" passwd="[CENSORED]" port="syseng3-vm"/>
</fencedevices>
<rm>
<failoverdomains>
<failoverdomain name="AllNodes" nofailback="0" ordered="0" restricted="0">
<failoverdomainnode name="syseng1-vm" priority="1"/>
<failoverdomainnode name="syseng2-vm" priority="1"/>
<failoverdomainnode name="syseng3-vm" priority="1"/>
</failoverdomain>
</failoverdomains>
<resources>
<ip address="10.10.16.234" monitor_link="on" sleeptime="10"/>
<fs device="/dev/vgpg/pgsql" fsid="62946" mountpoint="/opt/rg" name="SharedDisk"/>
<script file="/etc/rc.d/init.d/postgresql" name="postgresql"/>
</resources>
<service autostart="1" domain="AllNodes" exclusive="0" name="TestDB" recovery="relocate">
<ip ref="10.10.16.234"/>
<fs ref="SharedDisk"/>
<script ref="postgresql"/>
</service>
</rm>
</cluster>



Anyone have any ideas in what is going on?


--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]