[Linux-cluster] Nodes leaving and re-joining intermittently (Matthew Painter)

Geovanis, Nicholas Nicholas.Geovanis at uscellular.com
Sun Dec 11 17:24:08 UTC 2011


>> We are trying to get to the bottom of some odd intermittent behavior
on a cluster. We are intermittently 
>> seeing nodes leave and rejoin clusters, without being fenced. Further
the gap between leaving on re-joining is 
>> 8 minutes. We are monitoring the latency between boxes, and it is
acceptable (<5ms).

>From my recent experience, the first thing I would check is the
multicast config and behavior. I've deployed a couple dozen 2-3-node
clusters (with GFS2) in three different data-centers with three
seriously different network configurations. Multicast is always an
issue. RH Knowledgebase article
https://access.redhat.com/kb/docs/DOC-39175 has a python script
multicast.py which exercises it from client and server ends. It has come
in very handy. It sounds like it may be an intermittent problem, in
which case I might alter the script to reduce traffic a little but run
it longer-term as a diagnostic. If you're at RHEL 6.1 there is an
"omping" package in the channel/distro which serves the same purpose,
there's some info in the article on its use too. HTH......Nick G


Nick Geovanis
US Cellular/Kforce Inc
e. Nicholas.Geovanis at uscellular.com


Message: 1
Date: Sat, 10 Dec 2011 20:32:05 +0000
From: Matthew Painter <matthew.painter at kusiri.com>
To: linux clustering <linux-cluster at redhat.com>
Subject: [Linux-cluster] Nodes leaving and re-joining intermittently
Message-ID:
	
<CALj8VcxxvOV_PTT9QZKJYnPuvhjBgoxNETBWxB4uCWCRhkzhSA at mail.gmail.com>
Content-Type: text/plain; charset="utf-8"

Hi all,

We are trying to get to the bottom of some odd intermittent behavior on
a cluster. We are intermittently seeing nodes leave and rejoin clusters,
without being fenced. Further the gap between leaving on re-joining is 8
minutes. We are monitoring the latency between boxes, and it is
acceptable (<5ms).

How can nodes exhibit this behavior? There seem to be no impact on the
services running on the box, just this leaving and re-joining. The SNMP
messages are below.

All help decoding this gratefully received! :)

Thanks,

Matt






More information about the Linux-cluster mailing list