Hi, I've got a 3-node RHEL 5.3 cluster. I'm running the cluster nodes as XEN Dom0 domains so I can deploy DomU domains as vm services within the cluster. Hardware is: 3 x Dell PowerEdge 1855 blades 2 x Dell PowerConnect 5316M Ethernet modules (for eth0 and eth1) I have a 4th blade acting as an iSCSI target, exporting a 2GB and two 20GB targets. The 2GB target is used as /etc/xen/ on the cluster nodes, mounted as a _netdev mount in /etc/fstab on the cluster nodes (mounted on /xen, with symlinks from /etc/xen to /xen/xen). All network traffic uses the same switch module, since I'm only using eth0 at this time. To install the nodes, I'm kickstarting from a Satellite, and doing a "yum update" followed by a reboot to get to RHEL 5.3. I also deploy the same cluster.conf to each node (appended to this email). I then bring up cman, rgmanager. clvmd and gfs on all nodes (using the "Send input to all sessions" feature of Konsole to start the services at the same time on all nodes). This brings up the cluster, and allows me to mount the iSCSI target for /xen. Starting xend allows me to enable the vm service listed in cluster.conf (clusvcadm -e vm:node1) Oh, I also log *.* to a syslog server so I can see all the logs in one place. Nodes are: c1.eris.qinetiq.com c2.eris.qinetiq.com c3.eris.qinetiq.com "So far so good", I think. So, I enable cman, rgmanager, clvmd, gfs and xend to start on boot and reboot the cluster (all three nodes at the same time) At which point everything starts to fall apart. As the nodes come up and try and create a cluster, nodes c1 and c2 appear to form a cluster, and then fence node c3 when it joins. When node c3 comes back up and tries to join the cluster, node c1 decides the cluster is no-longer quorate, and fences node c2. When node c2 comes back up and tries to join the cluster, node c1 decides the cluster is no-longer quorate, and fences node c3. This then continues for as long as I'm entertained watching the logs, and switch off all three servers. Does anyone have any insight as to what the difference is between starting the cluster services manually, and starting them at boot is, and why that difference (because I can't think of any other difference between the two states) would cause me to never gain a stable cluster? I'm at a bit of a loss really - I moved from a 2-node cluster to a 3-node one to try and avoid exactly these problems. I've also had the same problem with a CentOS 5.2 cluster on the same hardware - in that case the nodes were still fencing each other the following morning, 18 hours later! Regards, Mark. -- Mark Watts BSc RHCE MBCS Senior Systems Engineer QinetiQ Applied Technologies GPG Key: http://www.linux-corner.info/mwatts.gpg
<?xml version="1.0"?> <cluster alias="WebFarmTest" config_version="1" name="WebFarmTest"> <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> <clusternodes> <clusternode name="c1.eris.qinetiq.com" nodeid="1" votes="1"> <fence> <method name="1"> <device name="DRACMC" modulename="Server-1" action="Off"/> <device name="DRACMC" modulename="Server-1" action="On"/> </method> </fence> </clusternode> <clusternode name="c2.eris.qinetiq.com" nodeid="2" votes="1"> <fence> <method name="1"> <device name="DRACMC" modulename="Server-2" action="Off"/> <device name="DRACMC" modulename="Server-2" action="On"/> </method> </fence> </clusternode> <clusternode name="c3.eris.qinetiq.com" nodeid="3" votes="1"> <fence> <method name="1"> <device name="DRACMC" modulename="Server-3" action="Off"/> <device name="DRACMC" modulename="Server-3" action="On"/> </method> </fence> </clusternode> </clusternodes> <cman expected_votes="2"/> <fencedevices> <fencedevice agent="fence_drac" ipaddr="XXX" login="XXX" name="DRACMC" passwd="XXX"/> </fencedevices> <rm> <failoverdomains> <failoverdomain name="webfarm-fd" nofailback="0" ordered="0" restricted="1"> <failoverdomainnode name="c1.eris.qinetiq.com" priority="1"/> <failoverdomainnode name="c2.eris.qinetiq.com" priority="1"/> <failoverdomainnode name="c3.eris.qinetiq.com" priority="1"/> </failoverdomain> </failoverdomains> <resources/> <vm autostart="1" domain="webfarm-fd" exclusive="1" migrate="live" name="node1" path="/etc/xen/" recovery="relocate"/> </rm> </cluster>
Description: This is a digitally signed message part.