[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Cluster-devel] fence daemon problems



I observe strange problems with fencing when a cluster loose quorum for a short time.

 

After regain quorum, fenced reports ‘wait state   messages’, and whole cluster

is blocked waiting for fenced.

 

I can reproduce that bug here easily. It always happens with the following test:

 

Software: RHEL6.3 based kernel, corosync 1.4.4, cluster-3.1.93

 

I have 4 nodes. node hp4 is turned off for this test:

 

hp2:~# cman_tool nodes

Node  Sts   Inc   Joined               Name

   1   X      0                        hp4

   2   M   1232   2012-10-03 08:59:08  hp1

   3   M   1228   2012-10-03 08:58:58  hp3

   4   M   1220   2012-10-03 08:58:58  hp2

 

hp2:~# fence_tool ls

fence domain

member count  3

victim count  0

victim now    0

master nodeid 3

wait state    none

members       2 3 4

 

Everything runs fine so far (fence_tool ls output match on all nodes).

 

Now I unplug the network cable on hp1:

 

hp2:~# cman_tool nodes

Node  Sts   Inc   Joined               Name

   1   X      0                        hp4

   2   X   1232                        hp1

   3   M   1228   2012-10-03 08:58:58  hp3

   4   M   1220   2012-10-03 08:58:58  hp2

 

hp2:~# fence_tool ls

fence domain

member count  2

victim count  1

victim now    0

master nodeid 3

wait state    quorum

members       2 3 4

 

Same output on hp3 – so far so good .

In the fenced log I can find the following entries:

 

hp2:~# cat /var/log/cluster/fenced.log

Oct 03 08:59:08 fenced fenced 1349169030 started

Oct 03 08:59:09 fenced fencing deferred to hp3

 

on hp3:

 

hp3:~# cat /var/log/cluster/fenced.log

Oct 03 08:57:12 fenced fencing node hp4

Oct 03 08:57:21 fenced fence hp4 success

 

hp2:~# dlm_tool ls

dlm lockspaces

name          rgmanager

id            0x5231f3eb

flags         0x00000004 kern_stop

change        member 3 joined 1 remove 0 failed 0 seq 2,2

members       2 3 4

new change    member 2 joined 0 remove 1 failed 1 seq 3,3

new status    wait_messages 0 wait_condition 1 fencing

new members   3 4

 

same output on hp3.

 

Now I reconnect the network on hp1:

 

# cman_tool nodes

Node  Sts   Inc   Joined               Name

   1   X      0                        hp4

   2   M   1240   2012-10-03 09:07:41  hp1

   3   M   1228   2012-10-03 08:58:58  hp3

   4   M   1220   2012-10-03 08:58:58  hp2

 

So we have quorum again.

 

hp2:~# fence_tool ls

fence domain

member count  3

victim count  1

victim now    0

master nodeid 3

wait state    messages

members       2 3 4

 

same output on hp3, hp1 is different:

 

hp1:~# fence_tool ls

fence domain

member count  3

victim count  2

victim now    0

master nodeid 3

wait state    messages

members       2 3 4

 

Here are the fenced dumps – maybe someone can see what is wrong here?

 

hp2:~# fence_tool dump

1349247553 receive_complete 3:3 len 232

1349247751 cluster node 2 removed seq 1236

1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2

1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2

1349247751 add_change cg 3 remove nodeid 2 reason 3

1349247751 add_change cg 3 m 2 j 0 r 1 f 1

1349247751 add_victims node 2

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default ring 4:1236 2 memb 4 3

1349247751 check_ringid done cluster 1236 cpg 4:1236

1349247751 check_quorum not quorate

1349247751 fenced:daemon ring 4:1236 2 memb 4 3

1349248061 cluster node 2 added seq 1240

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left

1349248061 cpg_mcast_joined retried 5 protocol

1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3

1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left

1349248061 add_change cg 4 joined nodeid 2

1349248061 add_change cg 4 m 3 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:default ring 2:1240 3 memb 2 4 3

1349248061 check_ringid done cluster 1240 cpg 2:1240

1349248061 check_quorum done

1349248061 send_start 4:4 flags 2 started 2 m 3 j 1 r 0 f 0

1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061

1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 join 1349247548 left 0 local quorum 1349248061

1349248061 receive_start 4:4 len 232

1349248061 match_change 4:4 skip cg 3 expect counts 2 0 1 1

1349248061 match_change 4:4 matches cg 4

1349248061 wait_messages cg 4 need 2 of 3

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_start 3:5 len 232

1349248061 match_change 3:5 skip cg 3 expect counts 2 0 1 1

1349248061 match_change 3:5 matches cg 4

1349248061 wait_messages cg 4 need 1 of 3

1349248061 receive_start 2:5 len 232

1349248061 match_change 2:5 skip cg 3 sender not member

1349248061 match_change 2:5 matches cg 4

1349248061 receive_start 2:5 add node with started_count 1

1349248061 wait_messages cg 4 need 1 of 3

 

hp3:~# fence_tool dump

1349247553 receive_complete 3:3 len 232

1349247751 cluster node 2 removed seq 1236

1349247751 fenced:daemon conf 2 0 1 memb 3 4 join left 2

1349247751 fenced:default conf 2 0 1 memb 3 4 join left 2

1349247751 add_change cg 4 remove nodeid 2 reason 3

1349247751 add_change cg 4 m 2 j 0 r 1 f 1

1349247751 add_victims node 2

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default ring 4:1236 2 memb 4 3

1349247751 check_ringid done cluster 1236 cpg 4:1236

1349247751 check_quorum not quorate

1349247751 fenced:daemon ring 4:1236 2 memb 4 3

1349248061 cluster node 2 added seq 1240

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left

1349248061 cpg_mcast_joined retried 5 protocol

1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3

1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 join 1349247548 left 0 local quorum 1349248061

1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 2 left

1349248061 add_change cg 5 joined nodeid 2

1349248061 add_change cg 5 m 3 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 4:1236

1349248061 fenced:default ring 2:1240 3 memb 2 4 3

1349248061 check_ringid done cluster 1240 cpg 2:1240

1349248061 check_quorum done

1349248061 send_start 3:5 flags 2 started 3 m 3 j 1 r 0 f 0

1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 join 1349247425 left 0 local quorum 1349248061

1349248061 receive_start 4:4 len 232

1349248061 match_change 4:4 skip cg 4 expect counts 2 0 1 1

1349248061 match_change 4:4 matches cg 5

1349248061 wait_messages cg 5 need 2 of 3

1349248061 receive_start 3:5 len 232

1349248061 match_change 3:5 skip cg 4 expect counts 2 0 1 1

1349248061 match_change 3:5 matches cg 5

1349248061 wait_messages cg 5 need 1 of 3

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 2 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 2 stateful merge

1349248061 receive_start 2:5 len 232

1349248061 match_change 2:5 skip cg 4 sender not member

1349248061 match_change 2:5 matches cg 5

1349248061 receive_start 2:5 add node with started_count 1

1349248061 wait_messages cg 5 need 1 of 3

 

hp1:~# fence_tool dump

1349247551 our_nodeid 2 our_name hp1

1349247552 logging mode 3 syslog f 160 p 6 logfile p 6 /var/log/cluster/fenced.log

1349247552 logfile cur mode 100644

1349247552 cpg_join fenced:daemon ...

1349247552 setup_cpg_daemon 10

1349247552 group_mode 3 compat 0

1349247552 fenced:daemon conf 3 1 0 memb 2 3 4 join 2 left

1349247552 fenced:daemon ring 2:1232 3 memb 2 4 3

1349247552 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349247552 daemon node 4 max 0.0.0.0 run 0.0.0.0

1349247552 daemon node 4 join 1349247552 left 0 local quorum 1349247551

1349247552 run protocol from nodeid 4

1349247552 daemon run 1.1.1 max 1.1.1

1349247552 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349247552 daemon node 3 max 0.0.0.0 run 0.0.0.0

1349247552 daemon node 3 join 1349247552 left 0 local quorum 1349247551

1349247552 receive_protocol from 2 max 1.1.1.0 run 0.0.0.0

1349247552 daemon node 2 max 0.0.0.0 run 0.0.0.0

1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551

1349247552 receive_protocol from 2 max 1.1.1.0 run 1.1.1.0

1349247552 daemon node 2 max 1.1.1.0 run 0.0.0.0

1349247552 daemon node 2 join 1349247552 left 0 local quorum 1349247551

1349247553 client connection 3 fd 13

1349247553 added 4 nodes from ccs

1349247553 cpg_join fenced:default ...

1349247553 fenced:default conf 3 1 0 memb 2 3 4 join 2 left

1349247553 add_change cg 1 joined nodeid 2

1349247553 add_change cg 1 m 3 j 1 r 0 f 0

1349247553 add_victims_init nodeid 1

1349247553 check_ringid cluster 1232 cpg 0:0

1349247553 fenced:default ring 2:1232 3 memb 2 4 3

1349247553 check_ringid done cluster 1232 cpg 2:1232

1349247553 check_quorum done

1349247553 send_start 2:1 flags 1 started 0 m 3 j 1 r 0 f 0

1349247553 receive_start 3:3 len 232

1349247553 match_change 3:3 matches cg 1

1349247553 save_history 1 master 3 time 1349247441 how 1

1349247553 wait_messages cg 1 need 2 of 3

1349247553 receive_start 2:1 len 232

1349247553 match_change 2:1 matches cg 1

1349247553 wait_messages cg 1 need 1 of 3

1349247553 receive_start 4:2 len 232

1349247553 match_change 4:2 matches cg 1

1349247553 wait_messages cg 1 got all 3

1349247553 set_master from 0 to complete node 3

1349247553 fencing deferred to hp3

1349247553 receive_complete 3:3 len 232

1349247553 receive_complete clear victim nodeid 1 init 1

1349247750 cluster node 3 removed seq 1236

1349247750 cluster node 4 removed seq 1236

1349247751 fenced:daemon conf 2 0 1 memb 2 4 join left 3

1349247751 fenced:daemon conf 1 0 1 memb 2 join left 4

1349247751 fenced:daemon ring 2:1236 1 memb 2

1349247751 fenced:default conf 2 0 1 memb 2 4 join left 3

1349247751 add_change cg 2 remove nodeid 3 reason 3

1349247751 add_change cg 2 m 2 j 0 r 1 f 1

1349247751 add_victims node 3

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default conf 1 0 1 memb 2 join left 4

1349247751 add_change cg 3 remove nodeid 4 reason 3

1349247751 add_change cg 3 m 1 j 0 r 1 f 1

1349247751 add_victims node 4

1349247751 check_ringid cluster 1236 cpg 2:1232

1349247751 fenced:default ring 2:1236 1 memb 2

1349247751 check_ringid done cluster 1236 cpg 2:1236

1349247751 check_quorum not quorate

1349248061 cluster node 3 added seq 1240

1349248061 cluster node 4 added seq 1240

1349248061 check_ringid cluster 1240 cpg 2:1236

1349248061 fenced:daemon conf 2 1 0 memb 2 3 join 3 left

1349248061 cpg_mcast_joined retried 6 protocol

1349248061 fenced:daemon conf 3 1 0 memb 2 3 4 join 4 left

1349248061 fenced:daemon ring 2:1240 3 memb 2 4 3

1349248061 receive_protocol from 4 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 4 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 4 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 4 stateful merge

1349248061 receive_protocol from 3 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 3 max 0.0.0.0 run 0.0.0.0

1349248061 daemon node 3 join 1349248061 left 1349247751 local quorum 1349248061

1349248061 daemon node 3 stateful merge

1349248061 fenced:default conf 2 1 0 memb 2 3 join 3 left

1349248061 add_change cg 4 joined nodeid 3

1349248061 add_change cg 4 m 2 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 2:1236

1349248061 fenced:default conf 3 1 0 memb 2 3 4 join 4 left

1349248061 add_change cg 5 joined nodeid 4

1349248061 add_change cg 5 m 3 j 1 r 0 f 0

1349248061 check_ringid cluster 1240 cpg 2:1236

1349248061 fenced:default ring 2:1240 3 memb 2 4 3

1349248061 check_ringid done cluster 1240 cpg 2:1240

1349248061 check_quorum done

1349248061 send_start 2:5 flags 2 started 1 m 3 j 1 r 0 f 0

1349248061 receive_start 4:4 len 232

1349248061 match_change 4:4 skip cg 2 created 1349247751 cluster add 1349248061

1349248061 match_change 4:4 skip cg 3 sender not member

1349248061 match_change 4:4 skip cg 4 sender not member

1349248061 match_change 4:4 matches cg 5

1349248061 receive_start 4:4 add node with started_count 2

1349248061 wait_messages cg 5 need 3 of 3

1349248061 receive_start 3:5 len 232

1349248061 match_change 3:5 skip cg 2 sender not member

1349248061 match_change 3:5 skip cg 3 sender not member

1349248061 match_change 3:5 skip cg 4 expect counts 2 1 0 0

1349248061 match_change 3:5 matches cg 5

1349248061 receive_start 3:5 add node with started_count 3

1349248061 wait_messages cg 5 need 3 of 3

1349248061 receive_start 2:5 len 232

1349248061 match_change 2:5 skip cg 2 expect counts 2 0 1 1

1349248061 match_change 2:5 skip cg 3 expect counts 1 0 1 1

1349248061 match_change 2:5 skip cg 4 expect counts 2 1 0 0

1349248061 match_change 2:5 matches cg 5

1349248061 wait_messages cg 5 need 2 of 3

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.0

1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061

1349248061 receive_protocol from 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 max 1.1.1.0 run 1.1.1.1

1349248061 daemon node 2 join 1349247552 left 0 local quorum 1349248061

 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]