[Linux-cluster] Some nodes won't join after being fenced

Mailing List ml at adamdein.com
Thu Jul 31 18:42:13 UTC 2008


Hello,

I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed 
to break.

It is broken in almost exactly the same way as stated in these two 
previous threads:

http://www.spinics.net/lists/cluster/msg10304.html
http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html

However, I can find no resolution in the archives. My only guaranteed 
resolution at this point is a cold restart of all nodes which to me 
seems ridiculous (ie: I'm missing something).

To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are 
broken. When I fence/reboot them, cman starts but times out on starting 
fencing. cman_tools nodes shows them as joined but the fence domain 
looks broke.

Any ideas?

I have included some information for a good node, bad node, and 
/var/log/messages from a good node that did the fencing.

Good Node:

[root at cluster1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
    1   M    768   2008-07-31 12:47:19  cluster1-rhc
    2   M    776   2008-07-31 12:47:37  cluster2-rhc
    3   M    772   2008-07-31 12:47:19  cluster3-rhc
    4   M    788   2008-07-31 12:56:20  cluster4-rhc
    5   M    772   2008-07-31 12:47:19  cluster5-rhc
    6   M    784   2008-07-31 12:52:50  cluster6-rhc
    7   M    808   2008-07-31 13:24:24  cluster7-rhc
    8   X    800                        cluster8-rhc
    9   M    772   2008-07-31 12:47:19  cluster9-rhc
[root at cluster1 ~]# cman_tool services
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root at cluster1 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster1-rhc
Node ID: 1
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.81
[root at cluster1 ~]# group_tool
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root at cluster1 ~]#


Bad/broken Node:

[root at cluster7 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
    1   M    808   2008-07-31 13:24:24  cluster1-rhc
    2   M    808   2008-07-31 13:24:24  cluster2-rhc
    3   M    808   2008-07-31 13:24:24  cluster3-rhc
    4   M    808   2008-07-31 13:24:24  cluster4-rhc
    5   M    808   2008-07-31 13:24:24  cluster5-rhc
    6   M    808   2008-07-31 13:24:24  cluster6-rhc
    7   M    804   2008-07-31 13:24:24  cluster7-rhc
    8   X      0                        cluster8-rhc
    9   M    808   2008-07-31 13:24:24  cluster9-rhc
[root at cluster7 ~]# cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root at cluster7 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster7-rhc
Node ID: 7
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.87
[root at cluster7 ~]# group_tool
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root at cluster7 ~]#


/var/log/messages:

Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was 
successful
Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was 
successful
Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state 
from 12.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state 
from 11.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high 
seq received 89
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id 
for ring 324
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member 
10.128.161.81:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member 
10.128.161.82:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member 
10.128.161.83:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member 
10.128.161.84:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member 
10.128.161.85:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member 
10.128.161.86:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member 
10.128.161.89:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 
rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 
received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to 
originate any messages in recovery.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.87)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.88)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the 
primary component and will provide service.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.82
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.83
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.84
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.85
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.86
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.89
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 2
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 3
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 4
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 5
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 6
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 9
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state 
from 11.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high 
seq received 68
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id 
for ring 328
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member 
10.128.161.81:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member 
10.128.161.82:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member 
10.128.161.83:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member 
10.128.161.84:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member 
10.128.161.85:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member 
10.128.161.86:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member 
10.128.161.87:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member 
10.128.161.89:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 
rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 
received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to 
originate any messages in recovery.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) 
ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the 
primary component and will provide service.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.82
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.83
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.84
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.85
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.86
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 
10.128.161.89
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 6
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 9
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 2
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 3
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 4
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message 
from node 5

Thanks!

Adam




More information about the Linux-cluster mailing list