[Linux-cluster] Problem with service migration with xen domU on diferent dom0 with redhat 5.4

Sun May 16 00:50:13 UTC 2010

Problem solved: 

In this case I had several network related problems, now I have migration and failover. Something very usefull was to get multicast communication from virbr0 and also choosing a multicast address. Another important facto was that in this version, it not recognized the xen domain names as usual, I had to put -U xen:/// only with this added option the fence got executed. 

In rc.local of dom0 I put: 

/sbin/fence_xvmd -LX -I virbr0 -U xen:/// -a 224.0.0.1 

On cluster.conf I especified the address 224.0.0.1 and also genate two different fence_xvm.key for both hosts. My cluster.conf is: 

<?xml version="1.0"?> 
<cluster alias="clusterapache01" config_version="87" name="clusterapache01"> 
<clusternodes> 
<clusternode name="vmapache1.foo.com" nodeid="1" votes="1"> 
<fence> 
<method name="1"> 
<device domain="vmapache1" name="xenfence1"/> 
</method> 
</fence> 
<multicast addr="224.0.0.1"/> 
</clusternode> 
<clusternode name="vmapache2.foo.com" nodeid="2" votes="1"> 
<fence> 
<method name="1"> 
<device domain="vmapache2" name="xenfence2"/> 
</method> 
</fence> 
<multicast addr="224.0.0.1"/> 
</clusternode> 
</clusternodes> 
<cman expected_votes="3"> 
<multicast addr="224.0.0.1"/> 
</cman> 
<rm log_level="7"> 
<failoverdomains> 
<failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1"> 
<failoverdomainnode name="vmapache1.foo.com" priority="1"/> 
<failoverdomainnode name="vmapache2.foo.com" priority="2"/> 
</failoverdomain> 
</failoverdomains> 
<resources> 
<ip address="172.19.52.120" monitor_link="1"/> 
<apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/> 
<script file="/etc/init.d/httpd" name="httpd"/> 
</resources> 
<service autostart="1" domain="prefer_node1" exclusive="1" name="web-scs" recovery="relocate"> 
<ip ref="172.19.52.120"/> 
<script ref="httpd"/> 
</service> 
</rm> 
<totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/> 
<fencedevices> 
<fencedevice agent="fence_xvm" key_file="/etc/cluster/host-1.key" name="xenfence1"/> 
<fencedevice agent="fence_xvm" key_file="/etc/cluster/host-2.key" name="xenfence2"/> 
</fencedevices> 
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/> 
<quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1"> 
<heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/> 
</quorumd> 
<fence_xvmd/> 
</cluster> 

Now, the only thing I would like to do is to add a fabric fence as a fence backup when dom0 goes down. Someone has experience with 3com or DLink switche to perform as fabric fence? 

Best Regards, 

Carlos Vermejo Ruiz 

----- Mensaje original ----- 
De: "Carlos VERMEJO RUIZ" <cvermejo at softwarelibreandino.com> 
Para: linux-cluster at redhat.com 
Enviados: Lunes, 10 de Mayo 2010 22:42:42 
Asunto: Re: Problem with service migration with xen domU on diferent dom0 with redhat 5.4 

I just come back from a trip and made some changes at my cluster.conf but now I am getting a more clear error: 

May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 

Also I got more information telling me that cluster services on node 1 are down, when I restart rgmanager it starts working. 

More details: 

[root at vmapache2 ~]# service rgmanager status 
Se está ejecutando clurgmgrd (pid 1866)... 
[root at vmapache2 ~]# cman_tool status 
Version: 6.2.0 
Config Version: 60 
Cluster Name: clusterapache01 
Cluster Id: 38965 
Cluster Member: Yes 
Cluster Generation: 300 
Membership state: Cluster-Member 
Nodes: 2 
Expected votes: 3 
Quorum device votes: 1 
Total votes: 3 
Quorum: 2 
Active subsystems: 10 
Flags: Dirty 
Ports Bound: 0 11 177 
Node name: vmapache2.foo.com 
Node ID: 2 
Multicast addresses: 225.0.0.1 
Node addresses: 172.19.168.122 
[root at vmapache2 ~]# 

/Var/log/messages 

May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.121 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 
May 10 20:27:07 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 
May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940). 
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1 
May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1 
May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.121 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 
May 10 20:27:07 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 
May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940). 
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1 
May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1 
May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP 

[root at vmapache2 ~]# tail -n 100 /var/log/messages 
May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). 
May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes). 
May 10 20:24:25 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 2. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 0. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Creating commit token because I am the rep. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Saving state aru 49 high seq received 49 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 128 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering COMMIT state. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.122: 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] previous ring seq 292 rep 172.19.168.121 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] aru 49 high delivered 49 received flag 1 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] Sending initial ORF token 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] New Configuration: 
May 10 20:24:30 vmapache2 fenced[1620]: vmapache1.foo.com not a cluster member after 0 sec post_fail_delay 
May 10 20:24:30 vmapache2 kernel: dlm: closing connection to node 1 
May 10 20:24:30 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com DOWN 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) 
May 10 20:24:30 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Left: 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.121) 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Joined: 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] New Configuration: 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Left: 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] Members Joined: 
May 10 20:24:30 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service. 
May 10 20:24:30 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state. 
May 10 20:24:30 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 
May 10 20:24:30 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 
May 10 20:24:35 vmapache2 clurgmgrd[1867]: <info> Waiting for node #1 to be fenced 
May 10 20:24:47 vmapache2 qdiskd[1604]: <info> Assuming master role 
May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] lost contact with quorum device 
May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum lost, blocking activity 
May 10 20:24:49 vmapache2 clurgmgrd[1867]: <emerg> #1: Quorum Dissolved 
May 10 20:24:49 vmapache2 qdiskd[1604]: <notice> Writing eviction notice for node 1 
May 10 20:24:49 vmapache2 openais[1562]: [CMAN ] quorum regained, resuming activity 
May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1 
May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <err> Checking Existence Of File /var/run/cluster/apache/apache:web1.pid [apache:web1] > Failed - File Doesn't Exist 
May 10 20:24:49 vmapache2 clurgmgrd: [1867]: <info> Stopping Service apache:web1 > Succeed 
May 10 20:24:49 vmapache2 clurgmgrd[1867]: <notice> Quorum Regained 
May 10 20:24:49 vmapache2 clurgmgrd[1867]: <info> State change: Local UP 
May 10 20:24:51 vmapache2 qdiskd[1604]: <notice> Node 1 evicted 
May 10 20:25:00 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:25:00 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (32130). 
May 10 20:25:00 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:25:00 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:25:05 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" 
May 10 20:25:36 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:25:36 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (33270). 
May 10 20:25:36 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:25:36 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:25:41 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" 
May 10 20:26:11 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:26:11 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:26:16 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" 
May 10 20:26:47 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:26:47 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35010). 
May 10 20:26:47 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:26:47 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:26:52 vmapache2 fenced[1620]: fencing node "vmapache1.foo.com" 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering GATHER state from 11. 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Saving state aru 10 high seq received 10 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Storing new sequence id for ring 12c 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering COMMIT state. 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering RECOVERY state. 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [0] member 172.19.168.121: 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.121 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru a high delivered a received flag 1 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] position [1] member 172.19.168.122: 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] previous ring seq 296 rep 172.19.168.122 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] aru 10 high delivered 10 received flag 1 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] Did not need to originate any messages in recovery. 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] New Configuration: 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Left: 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Joined: 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] CLM CONFIGURATION CHANGE 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] New Configuration: 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.121) 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.122) 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Left: 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] Members Joined: 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] r(0) ip(172.19.168.121) 
May 10 20:27:07 vmapache2 openais[1562]: [SYNC ] This node is within the primary component and will provide service. 
May 10 20:27:07 vmapache2 openais[1562]: [TOTEM] entering OPERATIONAL state. 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.121 
May 10 20:27:07 vmapache2 openais[1562]: [CLM ] got nodejoin message 172.19.168.122 
May 10 20:27:07 vmapache2 openais[1562]: [CPG ] got joinlist message from node 2 
May 10 20:27:23 vmapache2 fenced[1620]: agent "fence_xvm" reports: Timed out waiting for response 
May 10 20:27:23 vmapache2 ccsd[1550]: Attempt to close an unopened CCS descriptor (35940). 
May 10 20:27:23 vmapache2 ccsd[1550]: Error while processing disconnect: Invalid request descriptor 
May 10 20:27:23 vmapache2 fenced[1620]: fence "vmapache1.foo.com" failed 
May 10 20:27:29 vmapache2 kernel: dlm: connecting to 1 
May 10 20:27:29 vmapache2 kernel: dlm: got connection from 1 
May 10 20:27:41 vmapache2 clurgmgrd[1867]: <info> State change: vmapache1.foo.com UP 

Here is my cluster.conf file: 

<?xml version="1.0"?> 
<cluster alias="clusterapache01" config_version="60" name="clusterapache01"> 
<fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="60"/> 
<clusternodes> 
<clusternode name="vmapache1.foo.com" nodeid="1" votes="1"> 
<fence> 
<method name="1"> 
<device domain="vmapache1" name="xenfence1"/> 
</method> 
</fence> 
<multicast addr="225.0.0.1" interface="eth1"/> 
</clusternode> 
<clusternode name="vmapache2.foo.com" nodeid="2" votes="1"> 
<fence> 
<method name="1"> 
<device domain="vmapache2" name="xenfence2"/> 
</method> 
</fence> 
<multicast addr="225.0.0.1" interface="eth1"/> 
</clusternode> 
</clusternodes> 
<cman expected_votes="3"> 
<multicast addr="225.0.0.1"/> 
</cman> 
<fencedevices> 
<fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host1.key" name="xenfence1"/> 
<fencedevice agent="fence_xvm" key_file="/etc/cluster/fence_xvm-host2.key" name="xenfence2"/> 
</fencedevices> 
<rm log_level="7"> 
<failoverdomains> 
<failoverdomain name="prefer_node1" nofailback="1" ordered="1" restricted="1"> 
<failoverdomainnode name="vmapache1.foo.com" priority="1"/> 
<failoverdomainnode name="vmapache2.foo.com" priority="2"/> 
</failoverdomain> 
</failoverdomains> 
<resources> 
<ip address="172.19.52.120" monitor_link="1"/> 
<netfs export="/data" force_unmount="0" fstype="nfs4" host="172.19.50.114" mountpoint="/var/www/html" name="htdoc" options="rw,no_root_squash"/> 
<apache config_file="conf/httpd.conf" name="web1" server_root="/etc/httpd" shutdown_wait="0"/> 
</resources> 
<service autostart="1" domain="prefer_node1" exclusive="0" name="web-scs" recovery="relocate"> 
<ip ref="172.19.52.120"/> 
<apache ref="web1"/> 
</service> 
</rm> 
<fence_xvmd/> 
<totem consensus="4800" join="60" token="10000" token_retransmits_before_loss_const="20"/> 
<quorumd device="/dev/sda1" interval="2" min_score="1" tko="10" votes="1"> 
<heuristic interval="2" program="ping -c1 -t1 172.19.52.119" score="1"/> 
</quorumd> 
</cluster> 

Best Regards, 

Carlos Vermejo Ruiz 
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100515/5df4d238/attachment.htm>