[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Some nodes won't join after being fenced



I have occasionally run into this problem, too.  I have found that sometimes I can work around the problem by chkconfig'ing clvmd,cman,and rgmanager off, rebooting, then manually starting cman, rgmanager, clvmd (in that order).  Usually, after that, I am able to fence the node(s) and they will rejoin automatically (after re-enabling automatic startup with chkconfig, of course).  I know this workaround doesn't explain *why* it happens, but it has more than once helped me get my cluster nodes back online without having to reboot all the nodes.

On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <ml adamdein com> wrote:
Hello,

I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed to break.

It is broken in almost exactly the same way as stated in these two previous threads:

http://www.spinics.net/lists/cluster/msg10304.html
http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html

However, I can find no resolution in the archives. My only guaranteed resolution at this point is a cold restart of all nodes which to me seems ridiculous (ie: I'm missing something).

To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are broken. When I fence/reboot them, cman starts but times out on starting fencing. cman_tools nodes shows them as joined but the fence domain looks broke.

Any ideas?

I have included some information for a good node, bad node, and /var/log/messages from a good node that did the fencing.

Good Node:

[root cluster1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
  1   M    768   2008-07-31 12:47:19  cluster1-rhc
  2   M    776   2008-07-31 12:47:37  cluster2-rhc
  3   M    772   2008-07-31 12:47:19  cluster3-rhc
  4   M    788   2008-07-31 12:56:20  cluster4-rhc
  5   M    772   2008-07-31 12:47:19  cluster5-rhc
  6   M    784   2008-07-31 12:52:50  cluster6-rhc
  7   M    808   2008-07-31 13:24:24  cluster7-rhc
  8   X    800                        cluster8-rhc
  9   M    772   2008-07-31 12:47:19  cluster9-rhc
[root cluster1 ~]# cman_tool services
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root cluster1 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster1-rhc
Node ID: 1
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.81
[root cluster1 ~]# group_tool
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root cluster1 ~]#


Bad/broken Node:

[root cluster7 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
  1   M    808   2008-07-31 13:24:24  cluster1-rhc
  2   M    808   2008-07-31 13:24:24  cluster2-rhc
  3   M    808   2008-07-31 13:24:24  cluster3-rhc
  4   M    808   2008-07-31 13:24:24  cluster4-rhc
  5   M    808   2008-07-31 13:24:24  cluster5-rhc
  6   M    808   2008-07-31 13:24:24  cluster6-rhc
  7   M    804   2008-07-31 13:24:24  cluster7-rhc
  8   X      0                        cluster8-rhc
  9   M    808   2008-07-31 13:24:24  cluster9-rhc
[root cluster7 ~]# cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root cluster7 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster7-rhc
Node ID: 7
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.87
[root cluster7 ~]# group_tool
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root cluster7 ~]#


/var/log/messages:

Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was successful
Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was successful
Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from 12.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high seq received 89
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 324
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.89:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.87)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.88)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.82
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.83
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.84
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.85
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.86
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.89
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 2
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 3
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 4
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 5
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 6
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 9
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high seq received 68
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 328
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.87:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member 10.128.161.89:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.82
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.83
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.84
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.85
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.86
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.89
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 6
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 9
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 2
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 3
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 4
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 5

Thanks!

Adam

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]