[Linux-cluster] Power fencing, Xen, Virtual Services, RHCS

Agnieszka Kukałowicz qqlka at nask.pl
Fri Feb 1 07:34:18 UTC 2008


Hi,

I have problem with 2 nodes cluster runnig xen virtual machines.
The configuration is very simple. Node 1 - d1 runs vm_service1 and node
2 - d2 runs vm_service2 and have configured APC Master Switch as fence
devices.

Everything works well: starting, stopping and migrating virtual services
between nodes. But the problem occurs when I try to test crash one of
the nodes by, for example, shutting down node d2. In this case node d1
discovers node d2 failed and fences it through APC device. After node d2
is up it joins cluster and try to relocate vm_service2. But during that
I get strange logs on node d2:

Jan 31 21:18:11 d2 openais[5485]: [TOTEM] entering OPERATIONAL state.
Jan 31 21:18:11 d2 openais[5485]: [CLM  ] got nodejoin message
10.0.200.101
Jan 31 21:18:11 d2 openais[5485]: [CLM  ] got nodejoin message
10.0.200.102
Jan 31 21:18:11 d2 openais[5485]: [CPG  ] got joinlist message from node
2
Jan 31 21:18:45 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:18:46 d2 openais[5485]: [TOTEM] Retransmit List: 31
....
Jan 31 21:19:10 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:11 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:11 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:11 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:12 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:15 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:15 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:16 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:16 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:16 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:16 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:17 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:17 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:18 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:18 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:18 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:19 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:19 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:19 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:20 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:20 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:20 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:20 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:21 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:21 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:21 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:21 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:21 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:21 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:23 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:23 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:23 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:24 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:24 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:24 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:26 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:26 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:26 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:27 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:27 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:27 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:29 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:29 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:29 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:30 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:30 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:30 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:32 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:32 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:32 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:33 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:33 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:33 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:35 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:35 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:35 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:36 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:36 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:36 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:38 d2 openais[5485]: [TOTEM] Retransmit List: 31
Jan 31 21:19:38 d2 openais[5485]: [TOTEM] FAILED TO RECEIVE
Jan 31 21:19:38 d2 openais[5485]: [TOTEM] entering GATHER state from 6.
Jan 31 21:19:43 d2 openais[5485]: [TOTEM] entering GATHER state from 0.
Jan 31 21:20:18 d2 openais[5485]: [TOTEM] The consensus timeout expired.
Jan 31 21:20:18 d2 openais[5485]: [TOTEM] entering GATHER state from 3.
Jan 31 21:20:52 d2 openais[5485]: [TOTEM] The consensus timeout expired.
Jan 31 21:20:52 d2 openais[5485]: [TOTEM] entering GATHER state from 3.

And on node d2:
Jan 31 21:18:08 d1 openais[5467]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 31 21:18:08 d1 openais[5467]: [CLM  ] New Configuration:
Jan 31 21:18:08 d1 openais[5467]: [CLM  ]       r(0) ip(10.0.200.101)
Jan 31 21:18:08 d1 openais[5467]: [CLM  ] Members Left:
Jan 31 21:18:08 d1 openais[5467]: [CLM  ] Members Joined:
Jan 31 21:18:08 d1 openais[5467]: [CLM  ] CLM CONFIGURATION CHANGE
Jan 31 21:18:08 d1 openais[5467]: [CLM  ] New Configuration:
Jan 31 21:18:09 d1 openais[5467]: [CLM  ]       r(0) ip(10.0.200.101)
Jan 31 21:18:09 d1 openais[5467]: [CLM  ]       r(0) ip(10.0.200.102)
Jan 31 21:18:09 d1 openais[5467]: [CLM  ] Members Left:
Jan 31 21:18:09 d1 openais[5467]: [CLM  ] Members Joined:
Jan 31 21:18:09 d1 openais[5467]: [CLM  ]       r(0) ip(10.0.200.102)
Jan 31 21:18:09 d1 openais[5467]: [SYNC ] This node is within the
primary component and will provide service.
Jan 31 21:18:09 d1 openais[5467]: [TOTEM] entering OPERATIONAL state.
Jan 31 21:18:09 d1 openais[5467]: [CLM  ] got nodejoin message
10.0.200.101
Jan 31 21:18:10 d1 openais[5467]: [CLM  ] got nodejoin message
10.0.200.102
Jan 31 21:18:10 d1 openais[5467]: [CPG  ] got joinlist message from node
2
Jan 31 21:18:15 d1 kernel: dlm: connecting to 1
Jan 31 21:18:15 d1 kernel: dlm: got connection from 1
Jan 31 21:19:47 d1 openais[5467]: [TOTEM] The token was lost in the
OPERATIONAL state.
Jan 31 21:19:47 d1 openais[5467]: [TOTEM] Receive multicast socket recv
buffer size (288000 bytes).
Jan 31 21:19:47 d1 openais[5467]: [TOTEM] Transmit multicast socket send
buffer size (288000 bytes).
Jan 31 21:19:47 d1 openais[5467]: [TOTEM] entering GATHER state from 2.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] entering GATHER state from 0.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Creating commit token because
I am the rep.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Saving state aru 30 high seq
received 31
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Storing new sequence id for
ring 4bc
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] entering COMMIT state.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] entering RECOVERY state.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] position [0] member
10.0.200.101:
Jan 31 21:19:52 d1 kernel: dlm: closing connection to node 1
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] previous ring seq 1208 rep
10.0.200.101
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] aru 30 high delivered 30
received flag 0
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] copying all old ring messages
from 31-31.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Originated 0 messages in
RECOVERY.
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Originated for recovery:
Jan 31 21:19:52 d1 fenced[5484]: d2.local.polska.pl not a cluster member
after 0 sec post_fail_delay
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Not Originated for recovery:
31
Jan 31 21:19:52 d1 fenced[5484]: fencing node "d2"
Jan 31 21:19:52 d1 openais[5467]: [TOTEM] Sending initial ORF token
Jan 31 21:19:53 d1 fenced[5484]: fence "d2" success 

In consequence, I cannot start cluster because node d1 constantly fences
node d2. 

Making some research I find out that the problem might be in xen
networking. During staring virtual service the xen bridges are
reconfigurating (am I wrong?) and therefore there is a problem with
communication between nodes. 
But I don't know what to do with xen configuration the cluster starts
working.

Cheers
Agnieszka Kukalowicz






More information about the Linux-cluster mailing list