[Linux-cluster] two node cluster with IP tiebreaker failed.

Wed Feb 25 08:17:49 UTC 2009

I think there is a problem, from "cman_tool status" shows:

Nodes: 2
Expected votes: 3
Total votes: 2

according to your cluster.conf , if all nodes and qdisk are online, the 
"Total votes" must be "3".  Probably "qdiskd" is not running, you can 
use " cman_tool nodes" to check if qdisk is working.

Mockey Chen wrote:
> ext Mockey Chen wrote:
>   
>> ext Kein He wrote:
>>   
>>     
>>> Hi Mockey,
>>>
>>> Could you please attach the output from " cman_tool status " and "
>>> cman_tool nodes -f" ?
>>>
>>>     
>>>       
>> Thanks your response.
>>
>> I try to run cman_tool status on as-2, but it hang, without output, and
>> even Ctrl+C also no effect.
>>   
>>     
> I manually reboot as-1, and the problem solved.
>
> There is the output of cman_tool
>
> [root at as-1 ~]# cman_tool status
> Version: 6.1.0
> Config Version: 19
> Cluster Name: azerothcluster
> Cluster Id: 20148
> Cluster Member: Yes
> Cluster Generation: 76
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 3
> Total votes: 2
> Quorum: 2 
> Active subsystems: 8
> Flags: Dirty
> Ports Bound: 0 177 
> Node name: as-1.localdomain
> Node ID: 1
> Multicast addresses: 239.192.78.3
> Node addresses: 10.56.150.3
> [root at as-1 ~]# cman_tool status -f
> Version: 6.1.0
> Config Version: 19
> Cluster Name: azerothcluster
> Cluster Id: 20148
> Cluster Member: Yes
> Cluster Generation: 76
> Membership state: Cluster-Member
> Nodes: 2
> Expected votes: 3
> Total votes: 2
> Quorum: 2 
> Active subsystems: 8
> Flags: Dirty
> Ports Bound: 0 177 
> Node name: as-1.localdomain
> Node ID: 1
> Multicast addresses: 239.192.78.3
> Node addresses: 10.56.150.3
>
>
> It seems cluster can not fence one of the node. How to solve it ?
>
>   
>> I open a new window and can using ssh to as-2, but  after login,  I can
>> not do anything, even a
>> simple 'ls' command is hung.
>>
>> It seem the system keep alive but do not provide any service. Really bad.
>>
>> Any way to debug this issue ?
>>   
>>     
>>> Mockey Chen wrote:
>>>     
>>>       
>>>> Hi,
>>>>
>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as fence
>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>>>> <?xml version="1.0"?>
>>>> <cluster alias="azerothcluster" config_version="19"
>>>> name="azerothcluster">
>>>>     <cman expected_votes="3" two_node="0"/>
>>>>     <clusternodes>
>>>>         <clusternode name="as-1.localdomain" nodeid="1" votes="1">
>>>>             <fence>
>>>>                 <method name="1">
>>>>                     <device name="ilo1"/>
>>>>                 </method>
>>>>             </fence>
>>>>         </clusternode>
>>>>         <clusternode name="as-2.localdomain" nodeid="2" votes="1">
>>>>             <fence>
>>>>                 <method name="1">
>>>>                     <device name="ilo2"/>
>>>>                 </method>
>>>>             </fence>
>>>>         </clusternode>
>>>>     </clusternodes>
>>>>         <quorumd interval="1" tko="10" votes="1" label="pingtest">
>>>>                 <heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
>>>> interval="2" tko="3"/>
>>>>         </quorumd>
>>>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>     <fencedevices>
>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>>>> login="power" name="ilo1" passwd="pass"/>
>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>>>> login="power" name="ilo2" passwd="pass"/>
>>>>     </fencedevices>
>>>> ...
>>>> ...
>>>>
>>>> To test one node lost heartbeat case, I disable ethereal card (eth0) on
>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node reboot.
>>>> The actual is as-1 lost connection to as-2.  as-2 detected it and try to
>>>> re-construct cluster, but failed, here is the syslog form as-2
>>>>
>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost in the
>>>> OPERATIONAL state.
>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast socket
>>>> recv buffer size (288000 bytes).
>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast socket
>>>> send buffer size (262142 bytes).
>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>> from 2.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>> from 0.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>>>> because I am the rep.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru 1f4 high
>>>> seq received 1f4
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence id for
>>>> ring 2c
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT state.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY state.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>>>> 10.56.150.4:
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq 40 rep
>>>> 10.56.150.3
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high delivered 1f4
>>>> received flag 1
>>>>
>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24 21:25:40 as-2
>>>> openais[4139]: [TOTEM] Did not need to originate any messages in
>>>> recovery.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial ORF token
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved
>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.3)
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>>>> activity
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION CHANGE
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0) ip(10.56.150.4)
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is within the
>>>> primary component and will provide service.
>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate.  Refusing
>>>> connection.
>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL state.
>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>>>> Connection refused
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>>>> 10.56.150.4
>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist message from
>>>> node 2
>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting something
>>>> evil.
>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get: Invalid
>>>> request descriptor
>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified (-111).
>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>> evil.
>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get: Invalid
>>>> request descriptor
>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified (-21).
>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting something
>>>> evil.
>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing disconnect:
>>>> Invalid request descriptor
>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address record for
>>>> 10.56.150.144 on eth0.
>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt (IP_ADD_MEMBERSHIP):
>>>> Address already in use
>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>>>
>>>>
>>>>
>>>>
>>>> I also found there are some errors in as-1's syslog
>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed changing RG
>>>> status
>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for eth0: Not
>>>> detected
>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on eth0...
>>>> ...
>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>>>> infrastructure after 30 seconds.
>>>> ...
>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>> infrastructure after 60 seconds.
>>>> ...
>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>> infrastructure after 90 seconds.
>>>>
>>>>
>>>> any comment is appreciated!
>>>>
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>   
>>>>       
>>>>         
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>     
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>>   
>>     
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>