[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Some nodes won't join after being fenced



We seem to have found part of the culprit.
 
We're using an Extreme switch that handles all of our traffic in seperate VLAN's, and the IGMP bits of the ExtremeOS seem to be interfering with the clusters ability to recover itself from such an episode.
 
At the moment we're leaning towards the Juniper switch as we moved to an identically configured (as far as ports and VLANs go) Juniper EX-4200 and the cluster was able to recover itself with a single node (of nine) being fenced.
 
While on the Extreme, each node needed to be fenced in turn for the cluster to be able to recover fully.  This means each node being able to mount the GFS mount r/w and actually be able to write and delete test files on the mount point.
 
Our testing continues and we're trying to come up with "real" evidence such as proof that some parts of the multicast traffic are or aren't being dealt with properly.  So far the empirical evidence suggests the above conclusions.
 
-ted

 
On 7/31/08, Brandon Young <bkyoung gmail com> wrote:
I have occasionally run into this problem, too.  I have found that sometimes I can work around the problem by chkconfig'ing clvmd,cman,and rgmanager off, rebooting, then manually starting cman, rgmanager, clvmd (in that order).  Usually, after that, I am able to fence the node(s) and they will rejoin automatically (after re-enabling automatic startup with chkconfig, of course).  I know this workaround doesn't explain *why* it happens, but it has more than once helped me get my cluster nodes back online without having to reboot all the nodes.


On Thu, Jul 31, 2008 at 1:42 PM, Mailing List <ml adamdein com> wrote:
Hello,

I currently have a 9 node centos 5.1 cman/gfs cluster which I've managed to break.

It is broken in almost exactly the same way as stated in these two previous threads:

http://www.spinics.net/lists/cluster/msg10304.html
http://www.redhat.com/archives/linux-cluster/2008-May/msg00060.html

However, I can find no resolution in the archives. My only guaranteed resolution at this point is a cold restart of all nodes which to me seems ridiculous (ie: I'm missing something).

To add a little details, I have nodes cluster1...9. Nodes 7 & 8 are broken. When I fence/reboot them, cman starts but times out on starting fencing. cman_tools nodes shows them as joined but the fence domain looks broke.

Any ideas?

I have included some information for a good node, bad node, and /var/log/messages from a good node that did the fencing.

Good Node:

[root cluster1 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
  1   M    768   2008-07-31 12:47:19  cluster1-rhc
  2   M    776   2008-07-31 12:47:37  cluster2-rhc
  3   M    772   2008-07-31 12:47:19  cluster3-rhc
  4   M    788   2008-07-31 12:56:20  cluster4-rhc
  5   M    772   2008-07-31 12:47:19  cluster5-rhc
  6   M    784   2008-07-31 12:52:50  cluster6-rhc
  7   M    808   2008-07-31 13:24:24  cluster7-rhc
  8   X    800                        cluster8-rhc
  9   M    772   2008-07-31 12:47:19  cluster9-rhc
[root cluster1 ~]# cman_tool services
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root cluster1 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster1-rhc
Node ID: 1
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.81
[root cluster1 ~]# group_tool
type             level name      id       state
fence            0     default   00010003 FAIL_START_WAIT
[1 2 3 4 5 6 9]
dlm              1     testgfs1  00020005 none
[1 2 3 4 5 6]
gfs              2     testgfs1  00010005 none
[1 2 3 4 5 6]
[root cluster1 ~]#


Bad/broken Node:

[root cluster7 ~]# cman_tool nodes
Node  Sts   Inc   Joined               Name
  1   M    808   2008-07-31 13:24:24  cluster1-rhc
  2   M    808   2008-07-31 13:24:24  cluster2-rhc
  3   M    808   2008-07-31 13:24:24  cluster3-rhc
  4   M    808   2008-07-31 13:24:24  cluster4-rhc
  5   M    808   2008-07-31 13:24:24  cluster5-rhc
  6   M    808   2008-07-31 13:24:24  cluster6-rhc
  7   M    804   2008-07-31 13:24:24  cluster7-rhc
  8   X      0                        cluster8-rhc
  9   M    808   2008-07-31 13:24:24  cluster9-rhc
[root cluster7 ~]# cman_tool services
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root cluster7 ~]# cman_tool status
Version: 6.1.0
Config Version: 13
Cluster Name: test
Cluster Id: 1678
Cluster Member: Yes
Cluster Generation: 808
Membership state: Cluster-Member
Nodes: 8
Expected votes: 9
Total votes: 8
Quorum: 5
Active subsystems: 7
Flags: Dirty
Ports Bound: 0
Node name: cluster7-rhc
Node ID: 7
Multicast addresses: 239.192.6.148
Node addresses: 10.128.161.87
[root cluster7 ~]# group_tool
type             level name     id       state
fence            0     default  00000000 JOIN_STOP_WAIT
[1 2 3 4 5 6 7 9]
[root cluster7 ~]#


/var/log/messages:

Jul 31 13:20:54 cluster3 fence_node[3813]: Fence of "cluster7-rhc" was successful
Jul 31 13:21:03 cluster3 fence_node[3815]: Fence of "cluster8-rhc" was successful
Jul 31 13:21:11 cluster3 openais[3084]: [TOTEM] entering GATHER state from 12.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Saving state aru 89 high seq received 89
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 324
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 7
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 kernel: dlm: closing connection to node 8
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.89:
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] previous ring seq 800 rep 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] aru 89 high delivered 89 received flag 1
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.87)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.88)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:21:16 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:21:16 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.81
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.82
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.83
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.84
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.85
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.86
Jul 31 13:21:16 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.89
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 2
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 3
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 4
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 5
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 6
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 9
Jul 31 13:21:16 cluster3 openais[3084]: [CPG  ] got joinlist message from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering GATHER state from 11.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Saving state aru 68 high seq received 68
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Storing new sequence id for ring 328
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering COMMIT state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering RECOVERY state.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [0] member 10.128.161.81:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [1] member 10.128.161.82:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [2] member 10.128.161.83:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [3] member 10.128.161.84:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [4] member 10.128.161.85:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [5] member 10.128.161.86:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [6] member 10.128.161.87:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 9 high delivered 9 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] position [7] member 10.128.161.89:
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] previous ring seq 804 rep 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] aru 68 high delivered 68 received flag 1
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] Did not need to originate any messages in recovery.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] CLM CONFIGURATION CHANGE
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] New Configuration:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.81)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.82)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.83)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.84)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.85)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.86)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.89)
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Left:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] Members Joined:
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ]         r(0) ip(10.128.161.87)
Jul 31 13:24:24 cluster3 openais[3084]: [SYNC ] This node is within the primary component and will provide service.
Jul 31 13:24:24 cluster3 openais[3084]: [TOTEM] entering OPERATIONAL state.
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.81
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.82
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.83
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.84
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.85
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.86
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.87
Jul 31 13:24:24 cluster3 openais[3084]: [CLM  ] got nodejoin message 10.128.161.89
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 6
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 9
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 1
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 2
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 3
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 4
Jul 31 13:24:24 cluster3 openais[3084]: [CPG  ] got joinlist message from node 5

Thanks!

Adam

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]