[Linux-cluster] Cluster of XEN guests unstable when rebooting a node under CS5.1

Thu Dec 13 18:43:23 UTC 2007

Good ! It seems the right solution. Below my answers/comments.

Thanks, Paolo

> On Wed, 2007-12-12 at 19:23 +0100, Paolo Marini wrote:
>   
>> I reiterate the request for help hoping someone has undergone (and 
>> hopefully solved) the same issues.
>>
>> I am building up a cluster of XEN Guests with root file system residing 
>> on a file on an GFS filesystem (iscsi actually).
>>
>> Each cluster node mounts an GFS file system residing on an iscsi device.
>>
>> For performance reasons, both the iscsi device and the physical nodes 
>> (part also of a cluster) use two gigabit ethernet with bonding and LACP. 
>> For the physical machines, I had to insert a sleep 30 on the 
>> /etc/init.d/iscsi script before the iscsi login, in order to wait for 
>> the bond interface to come up, otherwise the iscsi devices are not seen 
>> and no gfs mount is possible.
>>
>> Then, going to the cluster of XEN Guests, they work fine, I am able to 
>> migrate each one to a different physical node without problems on the guest.
>>
>> When I reboot or fence one of the guests, the guest cluster breaks, e.g. 
>> the quorum is dissolved and I have to fence ALL the nodes and reboot 
>> them in order for the cluster to restart.
>>     
>
> How many guests - and what are you using for fencing ?
>
>   
I am using 5 guests - 4 are within a cluster and the remaining one is a 
management node (nagios etc.). I am using fencing with fence_xvm and it 
is correctly configured and working. Each Physical node is a DELL PE860 
with 4 Gb of RAM, one quad XEON and 3 network interfaces, two are used 
for bonding and the third one is reserved for IPMI (which I use for 
fencing of the physical nodes).

The guests configure two network interfaces (eth0 and eth0:0), one is 
for private communications between the nodes and to the iscsi device, 
the other for the public access to the nodes. I am not using VLAN.
>> Does it have to do with the xen bridge going up and down for a time 
>> longer than the heartbeat timeout ?
>>     
>
> Not sure - it shouldn't be that big of a deal.  If you think that's the
> problem try adding:
>
>    <totem token="30000"/>
>
>   
It seems much more stable. More tests will prove this. By now, xm 
destroy on a guest causes the whole cluster of guests to stay up, detect 
the missing guest, fence successfully it. The machine restarts and 
rejoins the cluster.

> to the vm cluster's cluster.conf
>
> -- Lon
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>   

-------------- next part --------------
A non-text attachment was scrubbed...
Name: paolom.vcf
Type: text/x-vcard
Size: 298 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071213/7602581d/attachment.vcf>