[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] two node cluster with IP tiebreaker failed.



ext Kein He wrote:
>
> Unfortunately , you need a shared disk to run qdisk, it can not work
> in "diskless" mode right now.
>
Is there a way to avoid it ?

Unfortunately, I did not have a shared disk.

>
>> ext Brett Cave wrote:
>>  
>>> On Wed, Feb 25, 2009 at 11:45 AM, Mockey Chen <mockey chen nsn com>
>>> wrote:
>>>      
>>>> ext Kein He wrote:
>>>>          
>>>>> I think there is a problem, from "cman_tool status" shows:
>>>>>
>>>>> Nodes: 2
>>>>> Expected votes: 3
>>>>> Total votes: 2
>>>>>
>>>>>
>>>>> according to your cluster.conf , if all nodes and qdisk are online,
>>>>> the "Total votes" must be "3".  Probably "qdiskd" is not running, you
>>>>> can use " cman_tool nodes" to check if qdisk is working.
>>>>>
>>>>>               
>>>> Yes, here is "cman_tool nodes" output:
>>>> Node  Sts   Inc   Joined               Name
>>>>   1   M    112   2009-02-25 03:05:19  as-1.localdomain
>>>>   2   M    104   2009-02-25 03:05:19  as-2.localdomain
>>>>
>>>> A question is how to check whether qdisk is running ? and how to
>>>> run it ?
>>>>           
>>> [root blade3 ~]# service qdiskd status
>>> qdiskd (pid 2832) is running...
>>> [root blade3 ~]# pgrep qdisk -l
>>> 2832 qdiskd
>>> [root blade3 ~]# cman_tool nodes
>>> Node  Sts   Inc   Joined               Name
>>>    0   M      0   2009-02-19 16:11:55  /dev/sda5     ## This is qdisk.
>>>    1   M   1524   2009-02-20 22:27:32  blade1
>>>    2   M   1552   2009-02-24 04:39:24  blade2
>>>    3   M   1500   2009-02-19 16:11:03  blade3
>>>    4   M   1516   2009-02-19 16:11:22  blade4
>>>
>>> You can use "service qdisk start" to start it, or run it with
>>> /usr/sbin/qdisk -Q if you dont have the init script. If you installed
>>> from rpm on a rh type distro, then the script should be there.
>>>
>>> REgards,
>>> brett
>>>       
>> I try to use "service qdiskd start", but it failed:
>> [root as-2 ~]# service qdiskd start
>> Starting the Quorum Disk Daemon:                           [FAILED]
>> [root as-2 ~]# tail /var/log/messages
>> ...
>> Feb 26 09:19:40 as-2 qdiskd[14707]: <crit> Unable to match label
>> 'testing' to any device
>> Feb 26 09:19:46 as-2 clurgmgrd[4032]: <notice> Reconfiguring
>>
>> Here is my qdisk configuration, I copy it from "man qdisk":
>>         <quorumd interval="1" tko="10" votes="1" label="testing">
>>                 <heuristic program="ping 10.56.150.1 -c1 -t1" score="1"
>> interval="2" tko="3"/>
>>         </quorumd>
>>
>> How to map label to device. Note: I did not have any shared storage.
>>
>>  
>>>> Thanks.
>>>>          
>>>>> Mockey Chen wrote:
>>>>>              
>>>>>> ext Mockey Chen wrote:
>>>>>>
>>>>>>                  
>>>>>>> ext Kein He wrote:
>>>>>>>
>>>>>>>                      
>>>>>>>> Hi Mockey,
>>>>>>>>
>>>>>>>> Could you please attach the output from " cman_tool status " and "
>>>>>>>> cman_tool nodes -f" ?
>>>>>>>>
>>>>>>>>
>>>>>>>>                           
>>>>>>> Thanks your response.
>>>>>>>
>>>>>>> I try to run cman_tool status on as-2, but it hang, without
>>>>>>> output, and
>>>>>>> even Ctrl+C also no effect.
>>>>>>>
>>>>>>>                       
>>>>>> I manually reboot as-1, and the problem solved.
>>>>>>
>>>>>> There is the output of cman_tool
>>>>>>
>>>>>> [root as-1 ~]# cman_tool status
>>>>>> Version: 6.1.0
>>>>>> Config Version: 19
>>>>>> Cluster Name: azerothcluster
>>>>>> Cluster Id: 20148
>>>>>> Cluster Member: Yes
>>>>>> Cluster Generation: 76
>>>>>> Membership state: Cluster-Member
>>>>>> Nodes: 2
>>>>>> Expected votes: 3
>>>>>> Total votes: 2
>>>>>> Quorum: 2 Active subsystems: 8
>>>>>> Flags: Dirty
>>>>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>>>>> Node ID: 1
>>>>>> Multicast addresses: 239.192.78.3
>>>>>> Node addresses: 10.56.150.3
>>>>>> [root as-1 ~]# cman_tool status -f
>>>>>> Version: 6.1.0
>>>>>> Config Version: 19
>>>>>> Cluster Name: azerothcluster
>>>>>> Cluster Id: 20148
>>>>>> Cluster Member: Yes
>>>>>> Cluster Generation: 76
>>>>>> Membership state: Cluster-Member
>>>>>> Nodes: 2
>>>>>> Expected votes: 3
>>>>>> Total votes: 2
>>>>>> Quorum: 2 Active subsystems: 8
>>>>>> Flags: Dirty
>>>>>> Ports Bound: 0 177 Node name: as-1.localdomain
>>>>>> Node ID: 1
>>>>>> Multicast addresses: 239.192.78.3
>>>>>> Node addresses: 10.56.150.3
>>>>>>
>>>>>>
>>>>>> It seems cluster can not fence one of the node. How to solve it ?
>>>>>>
>>>>>>
>>>>>>                  
>>>>>>> I open a new window and can using ssh to as-2, but  after
>>>>>>> login,  I can
>>>>>>> not do anything, even a
>>>>>>> simple 'ls' command is hung.
>>>>>>>
>>>>>>> It seem the system keep alive but do not provide any service.
>>>>>>> Really
>>>>>>> bad.
>>>>>>>
>>>>>>> Any way to debug this issue ?
>>>>>>>
>>>>>>>                      
>>>>>>>> Mockey Chen wrote:
>>>>>>>>
>>>>>>>>                          
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have a two-nodes cluster, to avoid split-brain. I use ilo as
>>>>>>>>> fence
>>>>>>>>> device, IP tiebreaker. here is my /etc/cluster/cluster.conf
>>>>>>>>> <?xml version="1.0"?>
>>>>>>>>> <cluster alias="azerothcluster" config_version="19"
>>>>>>>>> name="azerothcluster">
>>>>>>>>>     <cman expected_votes="3" two_node="0"/>
>>>>>>>>>     <clusternodes>
>>>>>>>>>         <clusternode name="as-1.localdomain" nodeid="1"
>>>>>>>>> votes="1">
>>>>>>>>>             <fence>
>>>>>>>>>                 <method name="1">
>>>>>>>>>                     <device name="ilo1"/>
>>>>>>>>>                 </method>
>>>>>>>>>             </fence>
>>>>>>>>>         </clusternode>
>>>>>>>>>         <clusternode name="as-2.localdomain" nodeid="2"
>>>>>>>>> votes="1">
>>>>>>>>>             <fence>
>>>>>>>>>                 <method name="1">
>>>>>>>>>                     <device name="ilo2"/>
>>>>>>>>>                 </method>
>>>>>>>>>             </fence>
>>>>>>>>>         </clusternode>
>>>>>>>>>     </clusternodes>
>>>>>>>>>         <quorumd interval="1" tko="10" votes="1"
>>>>>>>>> label="pingtest">
>>>>>>>>>                 <heuristic program="ping 10.56.150.1 -c1 -t1"
>>>>>>>>> score="1"
>>>>>>>>> interval="2" tko="3"/>
>>>>>>>>>         </quorumd>
>>>>>>>>>     <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>>>>>>>     <fencedevices>
>>>>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.18"
>>>>>>>>> login="power" name="ilo1" passwd="pass"/>
>>>>>>>>>         <fencedevice agent="fence_ilo" hostname="10.56.154.19"
>>>>>>>>> login="power" name="ilo2" passwd="pass"/>
>>>>>>>>>     </fencedevices>
>>>>>>>>> ...
>>>>>>>>> ...
>>>>>>>>>
>>>>>>>>> To test one node lost heartbeat case, I disable ethereal card
>>>>>>>>> (eth0) on
>>>>>>>>> as-1, I expect as-2 takeover services on as-1 and as-1 node
>>>>>>>>> reboot.
>>>>>>>>> The actual is as-1 lost connection to as-2.  as-2 detected it and
>>>>>>>>> try to
>>>>>>>>> re-construct cluster, but failed, here is the syslog form as-2
>>>>>>>>>
>>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] The token was lost
>>>>>>>>> in the
>>>>>>>>> OPERATIONAL state.
>>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Receive multicast
>>>>>>>>> socket
>>>>>>>>> recv buffer size (288000 bytes).
>>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] Transmit multicast
>>>>>>>>> socket
>>>>>>>>> send buffer size (262142 bytes).
>>>>>>>>> Feb 24 21:25:35 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>>>>> from 2.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering GATHER state
>>>>>>>>> from 0.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Creating commit token
>>>>>>>>> because I am the rep.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Saving state aru
>>>>>>>>> 1f4 high
>>>>>>>>> seq received 1f4
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Storing new sequence
>>>>>>>>> id for
>>>>>>>>> ring 2c
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering COMMIT
>>>>>>>>> state.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering RECOVERY
>>>>>>>>> state.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] position [0] member
>>>>>>>>> 10.56.150.4:
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] previous ring seq
>>>>>>>>> 40 rep
>>>>>>>>> 10.56.150.3
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] aru 1f4 high
>>>>>>>>> delivered
>>>>>>>>> 1f4
>>>>>>>>> received flag 1
>>>>>>>>>
>>>>>>>>> Message from syslogd@ at Tue Feb 24 21:25:40 2009 ...
>>>>>>>>> as-2 clurgmgrd[4194]: <emerg> #1: Quorum Dissolved Feb 24
>>>>>>>>> 21:25:40
>>>>>>>>> as-2
>>>>>>>>> openais[4139]: [TOTEM] Did not need to originate any messages in
>>>>>>>>> recovery.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] Sending initial
>>>>>>>>> ORF token
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION
>>>>>>>>> CHANGE
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>>>>> Feb 24 21:25:40 as-2 clurgmgrd[4194]: <emerg> #1: Quorum
>>>>>>>>> Dissolved
>>>>>>>>> Feb 24 21:25:40 as-2 kernel: dlm: closing connection to node 1
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0)
>>>>>>>>> ip(10.56.150.4)
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0)
>>>>>>>>> ip(10.56.150.3)
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CMAN ] quorum lost, blocking
>>>>>>>>> activity
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] CLM CONFIGURATION
>>>>>>>>> CHANGE
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] New Configuration:
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ]     r(0)
>>>>>>>>> ip(10.56.150.4)
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Left:
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] Members Joined:
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [SYNC ] This node is
>>>>>>>>> within the
>>>>>>>>> primary component and will provide service.
>>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Cluster is not quorate. 
>>>>>>>>> Refusing
>>>>>>>>> connection.
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [TOTEM] entering OPERATIONAL
>>>>>>>>> state.
>>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing connect:
>>>>>>>>> Connection refused
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CLM  ] got nodejoin message
>>>>>>>>> 10.56.150.4
>>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified
>>>>>>>>> (-111).
>>>>>>>>> Feb 24 21:25:40 as-2 openais[4139]: [CPG  ] got joinlist
>>>>>>>>> message from
>>>>>>>>> node 2
>>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Someone may be attempting
>>>>>>>>> something
>>>>>>>>> evil.
>>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Error while processing get:
>>>>>>>>> Invalid
>>>>>>>>> request descriptor
>>>>>>>>> Feb 24 21:25:40 as-2 ccsd[4130]: Invalid descriptor specified
>>>>>>>>> (-111).
>>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting
>>>>>>>>> something
>>>>>>>>> evil.
>>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing get:
>>>>>>>>> Invalid
>>>>>>>>> request descriptor
>>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Invalid descriptor specified
>>>>>>>>> (-21).
>>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Someone may be attempting
>>>>>>>>> something
>>>>>>>>> evil.
>>>>>>>>> Feb 24 21:25:41 as-2 ccsd[4130]: Error while processing
>>>>>>>>> disconnect:
>>>>>>>>> Invalid request descriptor
>>>>>>>>> Feb 24 21:25:41 as-2 avahi-daemon[3756]: Withdrawing address
>>>>>>>>> record for
>>>>>>>>> 10.56.150.144 on eth0.
>>>>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: setsockopt
>>>>>>>>> (IP_ADD_MEMBERSHIP):
>>>>>>>>> Address already in use
>>>>>>>>> Feb 24 21:25:41 as-2 in.rdiscd[8641]: Failed joining addresse
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> I also found there are some errors in as-1's syslog
>>>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd[4332]: <err> #52: Failed
>>>>>>>>> changing RG
>>>>>>>>> status
>>>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> Link for
>>>>>>>>> eth0: Not
>>>>>>>>> detected
>>>>>>>>> Feb 25 11:27:09 as-1 clurgmgrd: [4332]: <warning> No link on
>>>>>>>>> eth0...
>>>>>>>>> ...
>>>>>>>>> Feb 25 11:27:36 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>>>> infrastructure after 30 seconds.
>>>>>>>>> ...
>>>>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>>>> infrastructure after 60 seconds.
>>>>>>>>> ...
>>>>>>>>> Feb 25 11:28:06 as-1 ccsd[4268]: Unable to connect to cluster
>>>>>>>>> infrastructure after 90 seconds.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> any comment is appreciated!
>>>>>>>>>
>>>>>>>>> -- 
>>>>>>>>> Linux-cluster mailing list
>>>>>>>>> Linux-cluster redhat com
>>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>>>
>>>>>>>>>                               
>>>>>>>> -- 
>>>>>>>> Linux-cluster mailing list
>>>>>>>> Linux-cluster redhat com
>>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>>
>>>>>>>>
>>>>>>>>                           
>>>>>>> -- 
>>>>>>> Linux-cluster mailing list
>>>>>>> Linux-cluster redhat com
>>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>>
>>>>>>>
>>>>>>>                       
>>>>>> -- 
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster redhat com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>>                   
>>>>> -- 
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster redhat com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>>               
>>>> -- 
>>>> Linux-cluster mailing list
>>>> Linux-cluster redhat com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>>           
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster redhat com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>       
>>
>> -- 
>> Linux-cluster mailing list
>> Linux-cluster redhat com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>   
>
> -- 
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]