[Linux-cluster] node fails to join cluster after it was fenced
Luis Godoy Gonzalez
lgodoy at atichile.com
Wed Mar 7 17:19:27 UTC 2007
We are programming a Maintenance Window to reboot node 1, bellow you can
find more configuration info.
Until this moment, we have had two problems that we describe like "big
problems". One of them was solved with a rgmanager update, and the other
(more extrange) was solved changing a 10/100/1000 switch for a 10/100
switch (that is the used in our producction platforms) .
Bellow I athach too a generic diagram of this instalation. This
instalation particulary only have one switch (commonly we have two
switch for redundant)
Thanks & Regards
Luis G.
================================================================================
[root at lvs-gt1 ~]# clustat
Member Status: Quorate
Member Name Status
------ ---- ------
lvs-gt2 Offline
lvs-gt1 Online, Local, rgmanager
Service Name Owner (Last) State
------- ---- ----- ------ -----
XXX1 lvs-gt1 started
XXX2 lvs-gt1 started
[root at lvs-gt1 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 10
Cluster name: lb_cluster
Cluster ID: 40372
Cluster Member: Yes
Membership state: Cluster-Member
Nodes: 1
Expected_votes: 1
Total_votes: 1
Quorum: 1
Active subsystems: 4
Node name: lvs-gt1
Node addresses: 192.168.150.21
[root at lvs-gt1 ~]# cman_tool nodes
Node Votes Exp Sts Name
1 1 1 M lvs-gt1
2 1 1 X lvs-gt2
[root at lvs-gt1 ~]# cman_tool services
Service Name GID LID State Code
Fence Domain: "default" 1 2 run -
[1]
DLM Lock Space: "Magma" 3 4 run -
[1]
User: "usrm::manager" 2 3 run -
[1]
[root at lvs-gt1 ~]# uname -a
Linux lvs-gt1 2.6.9-22.EL #1 Mon Sep 19 18:20:28 EDT 2005 i686 athlon
i386 GNU/Linux
[root at lvs-gt1 ~]# rpm -qa | grep cman
cman-kernel-2.6.9-39.5
cman-1.0.2-0
cman-kernel-hugemem-2.6.9-39.5
cman-kernheaders-2.6.9-39.5
cman-kernel-smp-2.6.9-39.5
[root at lvs-gt1 ~]# rpm -qa | grep -i ccs
ccs-1.0.2-0
[root at lvs-gt1 ~]# rpm -qa | grep -i fence
fence-1.32.6-0
[root at lvs-gt1 ~]# rpm -qa | grep -i rgma
rgmanager-1.9.53-0
OTHER NODE
==========
[root at lvs-gt1 log]# ssh lvs-gt2
Last login: Tue Mar 6 17:57:07 2007 from 172.22.22.52
[root at lvs-gt2 ~]# tail /var/log/messages
Mar 7 09:47:06 lvs-gt2 kernel: CMAN: sending membership request
Mar 7 09:47:41 lvs-gt2 last message repeated 7 times
Mar 7 09:47:56 lvs-gt2 last message repeated 3 times
Mar 7 09:47:57 lvs-gt2 sshd(pam_unix)[13006]: session opened for user
root by root(uid=0)
Mar 7 09:48:01 lvs-gt2 kernel: CMAN: sending membership request
Mar 7 09:48:01 lvs-gt2 crond(pam_unix)[12936]: session closed for user root
Mar 7 09:48:01 lvs-gt2 crond(pam_unix)[13039]: session opened for user
root by (uid=0)
Mar 7 09:48:01 lvs-gt2 su(pam_unix)[13044]: session opened for user
admin by (uid=0)
Mar 7 09:48:01 lvs-gt2 su(pam_unix)[13044]: session closed for user admin
Mar 7 09:48:06 lvs-gt2 kernel: CMAN: sending membership request
[root at lvs-gt2 ~]#
[root at lvs-gt2 ~]# clustat
Segmentation fault
[root at lvs-gt2 ~]# cman_tool status
Protocol version: 5.0.1
Config version: 10
Cluster name: lb_cluster
Cluster ID: 40372
Cluster Member: No
Membership state: Joining
Patrick Caulfield wrote:
> Luis Godoy Gonzalez wrote:
>
>> Hi
>>
>> The "IPtable" service is not running on both nodes.
>> We are thinking in update the platform (RHE4 U4 RHCS 4U4) but thid is
>> not easy right now because we have several servers on production.
>> Another reason to not do it the version update is that we are waiting
>> for an update 5 por RHE4 or the production release for RHE5.
>> In this moment we only update "rgmanager" in some sites (we have several
>> issues with the rgmanager of update 2 RHCS4).
>>
>>
>
> It is really rather odd. Node 1 can obviously see the joinreq messages - at
> least tcpdump can, but cman is either not seeing them or ignoring them.
>
> What really bothers me is that this seems to be affecting U2 and U4 - if both of
> you were using U3 I would think no more of it :)
>
> Annoyingly it's hard to debug at this level (you can't strace a kernel thread!).
> I"m pretty sure that a reboot of node1 would fix the problem but that's hardly
> helpful.
>
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: generico.png
Type: image/png
Size: 31248 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070307/234bbb10/attachment.png>
More information about the Linux-cluster
mailing list