[Linux-cluster] Nodes leaving and re-joining intermittently

Sat Dec 10 22:00:12 UTC 2011

The switch was our first thought, but that has been swapped, and while we
are not having nodes fenced anymore (we were daily), this anomoly remains.

I will ask for those logs and conf on Monday.

I think it might be worth reinstalling corosync on this box anyway? Can't
be healthy if it is exiting unclearly. I have has reports of the rgmanager
dying on this box. (pid file but not running) Could that be related?

Thanks :)

On Saturday, December 10, 2011, Digimer <linux at alteeve.com> wrote:
> On 12/10/2011 03:32 PM, Matthew Painter wrote:
>> Hi all,
>>
>> We are trying to get to the bottom of some odd intermittent behavior on
>> a cluster. We are intermittently seeing nodes leave and rejoin clusters,
>> without being fenced. Further the gap between leaving on re-joining is 8
>> minutes. We are monitoring the latency between boxes, and it is
>> acceptable (<5ms).
>>
>> How can nodes exhibit this behavior? There seem to be no impact on the
>> services running on the box, just this leaving and re-joining. The SNMP
>> messages are below.
>>
>> All help decoding this gratefully received! :)
>>
>> Thanks,
>>
>> Matt
>>
>>
>> Sat Dec 10 15:22:00 GMT 2011: cluster3.localdomain
>> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:2:52:23.35,
>> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus,
>> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain",
>> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
>> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1",
>> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "left"
>>
>> Sat Dec 10 15:30:25 GMT 2011: cluster3.localdomain
>> DISMAN-EVENT-MIB::sysUpTimeInstance = 3:3:00:48.75,
>> SNMPv2-MIB::snmpTrapOID.0 = COROSYNC-MIB::corosyncNoticesNodeStatus,
>> COROSYNC-MIB::corosyncObjectsNodeName.0 = "cluster1.localdomain",
>> COROSYNC-MIB::corosyncObjectsNodeID.0 = 1,
>> COROSYNC-MIB::corosyncObjectsNodeAddress.0 = "10.79.202.1",
>> COROSYNC-MIB::corosyncObjectsNodeStatus.0 = "joined"
>
> My first instinct is to point to multicast issues in your switch, but
> then, I'd expect the node to get fenced. That said, any unexpected
> disconnect should fire a fence, so it would seem like the node is
> cleanly stopping/restarting corosync.
>
> Can you share your configuration and, ideally, anything in syslog from
> all involved nodes starting from just before the disconnect and
> continuing through to after the node rejoins?
>
> --
> Digimer
> E-Mail:              digimer at alteeve.com
> Freenode handle:     digimer
> Papers and Projects: http://alteeve.com
> Node Assassin:       http://nodeassassin.org
> "omg my singularity battery is dead again.
> stupid hawking radiation." - epitron
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20111210/13776416/attachment.htm>