[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Nodes leaving and re-joining intermittently



The switch was our first thought, but that has been swapped, and while we are not having nodes fenced anymore (we were daily), this anomoly remains.

I will ask for those logs and conf on Monday.

I think it might be worth reinstalling corosync on this box anyway? Can't be healthy if it is exiting unclearly. I have has reports of the rgmanager dying on this box. (pid file but not running) Could that be related?

Thanks :)

On Saturday, December 10, 2011, Digimer <linux alteeve com> wrote:
> On 12/10/2011 03:32 PM, Matthew Painter wrote:
>> Hi all,
>>
>> We are trying to get to the bottom of some odd intermittent behavior on
>> a cluster. We are intermittently seeing nodes leave and rejoin clusters,
>> without being fenced. Further the gap between leaving on re-joining is 8
>> minutes. We are monitoring the latency between boxes, and it is
>> acceptable (<5ms).
>>
>> How can nodes exhibit this behavior? There seem to be no impact on the
>> services running on the box, just this leaving and re-joining. The SNMP
>> messages are below.
>>
>> All help decoding this gratefully received! :)
>>
>> Thanks,
>>
>> Matt
>>
>>
>> Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain
>> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35,
>> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus,
>> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain",
>> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
>> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1",
>> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left"
>>
>> Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain
>> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75,
>> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus,
>> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain",
>> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
>> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1",
>> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined"
>
> My first instinct is to point to multicast issues in your switch, but
> then, I'd expect the node to get fenced. That said, any unexpected
> disconnect should fire a fence, so it would seem like the node is
> cleanly stopping/restarting corosync.
>
> Can you share your configuration and, ideally, anything in syslog from
> all involved nodes starting from just before the disconnect and
> continuing through to after the node rejoins?
>
> --
> Digimer
> E-Mail:              digimer alteeve com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron
>
[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]