[Linux-cluster] Help needed

Fri Jun 1 21:04:49 UTC 2012

Shr289.cup.hp.com resolves to 16.89.116.32
Shr295.cup.hp.com resolves to 16.89.112.182
I would assume that our switches should support multicast, since we have another cluster RH6.2 which runs OK using the same switch.
Also I'll put the fencing in the cluster conf to try it again.
Thanks
Ming

-----Original Message-----
From: Digimer [mailto:lists at alteeve.ca]
Sent: Friday, June 01, 2012 11:44 AM
To: Chen, Ming Ming
Cc: linux clustering
Subject: Re: [Linux-cluster] Help needed

What does 'shr289.cup.hp.com' and 'shr295.cup.hp.com' resolve to? Does
your switch support multicast properly? If the switch periodically tears
down a multicast group, your cluster will partition.

You *must* have fencing configured. Fencing using iLO works fine, please
use it. See
https://alteeve.com/w/2-Node_Red_Hat_KVM_Cluster_Tutorial#Example_.3Cfencedevice....3E_Tag_For_HP_iLO
Without fencing, you cluster will be unstable.

Digimer

On 06/01/2012 01:53 PM, Chen, Ming Ming wrote:
> Thanks for returning my email. The cluster configuration file and network configuration. Also one bad news is that the original issues come back again.
> So I've see two problems, and both problems will come sporatically:
> Thanks again for your help.
> Regards
> Ming
>
> 1. The original one. I've increased the version number, and it was gone for a while, but come back. Do you know why?
>
>    May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>
> 2. > [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>
> Cluster configuration File:
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
>
> I had a fencing configuration there, but I'd like to see that I can bring up a simple cluster first, then will add the fencing there.
>
> The network configuration:
> eth1      Link encap:Ethernet  HWaddr 00:23:7D:36:05:20
>           inet addr:16.89.112.182  Bcast:16.89.119.255  Mask:255.255.248.0
>           inet6 addr: fe80::223:7dff:fe36:520/64 Scope:Link
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:1210316 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:73158 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:1000
>           RX bytes:150775766 (143.7 MiB)  TX bytes:11749950 (11.2 MiB)
>           Interrupt:16 Memory:f6000000-f6012800
>
> lo        Link encap:Local Loopback
>           inet addr:127.0.0.1  Mask:255.0.0.0
>           inet6 addr: ::1/128 Scope:Host
>           UP LOOPBACK RUNNING  MTU:16436  Metric:1
>           RX packets:291 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:291 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:38225 (37.3 KiB)  TX bytes:38225 (37.3 KiB)
>
> virbr0    Link encap:Ethernet  HWaddr 52:54:00:30:33:BD
>           inet addr:192.168.122.1  Bcast:192.168.122.255  Mask:255.255.255.0
>           UP BROADCAST RUNNING MULTICAST  MTU:1500  Metric:1
>           RX packets:0 errors:0 dropped:0 overruns:0 frame:0
>           TX packets:488 errors:0 dropped:0 overruns:0 carrier:0
>           collisions:0 txqueuelen:0
>           RX bytes:0 (0.0 b)  TX bytes:25273 (24.6 KiB)
>
>
> -----Original Message-----
> From: Digimer [mailto:lists at alteeve.ca]
> Sent: Thursday, May 31, 2012 7:05 PM
> To: Chen, Ming Ming
> Cc: linux clustering
> Subject: Re: [Linux-cluster] Help needed
>
> Send your cluster.conf please, editing only password please. Please also
> include you network configs.
>
> On 05/31/2012 08:12 PM, Chen, Ming Ming wrote:
>> Hi Digimer,
>> Thanks for your comment. I've got rid of the first problem, and now I have the following messages. Any idea?
>> Thanks in advance.
>> Ming
>>
>> [root at shr295 ~]# tail -f /var/log/messages
>> May 31 16:56:01 shr295 dlm_controld[3375]: dlm_controld 3.0.12.1 started
>> May 31 16:56:11 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:12 shr295 gfs_controld[3447]: gfs_controld 3.0.12.1 started
>> May 31 16:56:12 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:21 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:22 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:31 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:32 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>> May 31 16:56:41 shr295 fenced[3353]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 dlm_controld[3375]: daemon cpg_join error retrying
>> May 31 16:56:42 shr295 gfs_controld[3447]: daemon cpg_join error retrying
>>
>> -----Original Message-----
>> From: Digimer [mailto:lists at alteeve.ca]
>> Sent: Thursday, May 31, 2012 10:13 AM
>> To: Chen, Ming Ming
>> Cc: linux clustering
>> Subject: Re: [Linux-cluster] Help needed
>>
>> On 05/31/2012 12:27 PM, Chen, Ming Ming wrote:
>>>  Hi, I have the following simple cluster config just to try out on SertOS 6.2
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="2" name="vmcluster">
>>>       <logging debug="on"/>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <clusternodes>
>>>             <clusternode name="shr289.cup.hp.com" nodeid="1">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>             <clusternode name="shr295.cup.hp.com" nodeid="2">
>>>                   <fence>
>>>                   </fence>
>>>             </clusternode>
>>>       </clusternodes>
>>>       <fencedevices>
>>>       </fencedevices>
>>>       <rm>
>>>       </rm>
>>> </cluster>
>>>
>>>
>>> And I got the following error message when I did "service cman start" I got the same messages on both nodes.
>>> Any help will be appreciated.
>>>
>>> May 31 09:08:04 corosync [TOTEM ] RRP multicast threshold (100 problem count)
>>> May 31 09:08:05 shr295 corosync[3542]:   [MAIN  ] Completed service synchronizat
>>> ion, ready to provide service.
>>> May 31 09:08:05 shr295 corosync[3542]:   [TOTEM ] A processor joined or left the
>>>  membership and a new membership was formed.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> e
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Error reloading the configurat
>>> ion, will retry every second
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Node 1 conflict, remote config
>>>  version id=4, local=2
>>> -- VISUAL BLOCK --r295 corosync[3542]:   [CMAN  ] Unable to load new config in c
>>> orosync: New configuration version has to be newer than current running configur
>>> ation
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Can't get updated config versi
>>> on 4: New configuration version has to be newer than current running configurati
>>> on#012.
>>> May 31 09:08:05 shr295 corosync[3542]:   [CMAN  ] Activity suspended on this nod
>>> E
>>>
>>
>> Run 'cman_tool version' to get the current version of the configuration,
>> then increase the config_version="x" to be one higher.
>>
>> Also, configure fencing! If you don't, your cluster will hang the first
>> time anything goes wrong.
>>
>> --
>> Digimer
>> Papers and Projects: https://alteeve.com
>
>
> --
> Digimer
> Papers and Projects: https://alteeve.com

--
Digimer
Papers and Projects: https://alteeve.com