[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Cluster of XEN guests unstable when rebooting a node under CS5.1

Good ! It seems the right solution. Below my answers/comments.

Thanks, Paolo

On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:
I reiterate the request for help hoping someone has undergone (and hopefully solved) the same issues.

I am building up a cluster of XEN Guests with root file system residing on a file on an GFS filesystem (iscsi actually).

Each cluster node mounts an GFS file system residing on an iscsi device.

For performance reasons, both the iscsi device and the physical nodes (part also of a cluster) use two gigabit ethernet with bonding and LACP. For the physical machines, I had to insert a sleep 30 on the /etc/init.d/iscsi script before the iscsi login, in order to wait for the bond interface to come up, otherwise the iscsi devices are not seen and no gfs mount is possible.

Then, going to the cluster of XEN Guests, they work fine, I am able to migrate each one to a different physical node without problems on the guest.

When I reboot or fence one of the guests, the guest cluster breaks, e.g. the quorum is dissolved and I have to fence ALL the nodes and reboot them in order for the cluster to restart.

How many guests - and what are you using for fencing ?

I am using 5 guests - 4 are within a cluster and the remaining one is a management node (nagios etc.). I am using fencing with fence_xvm and it is correctly configured and working. Each Physical node is a DELL PE860 with 4 Gb of RAM, one quad XEON and 3 network interfaces, two are used for bonding and the third one is reserved for IPMI (which I use for fencing of the physical nodes).

The guests configure two network interfaces (eth0 and eth0:0), one is for private communications between the nodes and to the iscsi device, the other for the public access to the nodes. I am not using VLAN.
Does it have to do with the xen bridge going up and down for a time longer than the heartbeat timeout ?

Not sure - it shouldn't be that big of a deal.  If you think that's the
problem try adding:

   <totem token="30000"/>

It seems much more stable. More tests will prove this. By now, xm destroy on a guest causes the whole cluster of guests to stay up, detect the missing guest, fence successfully it. The machine restarts and rejoins the cluster.

to the vm cluster's cluster.conf

-- Lon

Linux-cluster mailing list
Linux-cluster redhat com

fn:Paolo Marini
org:Prisma Engineering srl
adr;dom:;;via Petrocchi 4;Milano;Italy;20152
email;internet:paolom prisma-eng it
tel;work:+39 02 26113507
tel;fax:+39 02 26113597
tel;cell:+39 335 6525835

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]