[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

RE: [Linux-cluster] error messages explained

When you say, need to join with the services running. What services do I
need to start in order to do this manual join? Just cman? If a node crashes
and cant rejoin. I have to hurry up (before its fenced again) and disable
the auto start (chkconfig) of the following services: rgmanager, gfs, clvmd,
and cman. Then reboot that node again? Then start cman and try to rejoin
with just the cman_tool?

The question is, if a server isn't part of a cluster anymore (aka, it was
rebooted), the cluster obviously recognizes that disconnect and since the
node was rebooted, it shouldn't even think its part of a cluster. So why in
the world does anything think it is?

All these manually changes after a simple node reboot or fencing just
doesn't seem like a good design plan. I don't consider myself even
moderately knowledgeable in this arena, I am just looking at this from a
design perspective.

-----Original Message-----
From: linux-cluster-bounces redhat com
[mailto:linux-cluster-bounces redhat com] On Behalf Of Christine Caulfield
Sent: Friday, October 03, 2008 2:39 AM
To: linux clustering
Subject: Re: [Linux-cluster] error messages explained

Mark Chaney wrote:
> Cam someone explain to me these errors and tell me how I should attempt to
> resolve them? They both aren't happening at the same time exactly, its
> to errors that I don't truly understand.
> ####################
> ccsd[3192]: Attempt to close an unopened CCS descriptor (13590).
> ccsd[3192]: Error while processing disconnect:
> Invalid request descriptor
> ##################
> openais[5453]: [MAIN ] Killing node ratchet.local because it has rejoined
> the cluster with existing state

I need to add this to the FAQ!

What this message means is that a node was a valid member of the cluster
once; it then left the cluster (without being fenced) and rejoined
automatically. This can sometimes happen if the ethernet is disconnected
for a time, usually a few seconds.

If a node leave the cluster, it MUST rejoin using the cman_tool join
command with no services running. The usual way to make this happen is
to reboot the node, and if fencing is configured correctly that is what
normally happens. It could be that fencing is too slow to manage this or
that the cluster is made up of two nodes without a quorum disk so that
the 'other' node doesn't have quorum and cannot initiate fencing.

Another (more common) cause of this, is slow responding of some Cisco
switches as documented here:




Linux-cluster mailing list
Linux-cluster redhat com

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]