[Linux-cluster] error messages explained

Mark Chaney macscr at macscr.com
Sat Oct 4 17:38:13 UTC 2008


Unfortunately simply rebooting has never resolved those errors. =/. I am
getting these errors after a server is fenced and is rebooted. Then its
fenced again, still same errors. I basically have to shutdown the entire
cluster manually, reboot with all init scripts off, then have manually start
all cluster services and add the services back to chkconfig. This is
basically the process I have to do 95% of the time when a single server is
fenced. =/

-----Original Message-----
From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield
Sent: Saturday, October 04, 2008 9:29 AM
To: linux clustering
Subject: Re: [Linux-cluster] error messages explained

Mark Chaney wrote:
> When you say, need to join with the services running. What services do I
> need to start in order to do this manual join? Just cman? If a node
crashes
> and cant rejoin. I have to hurry up (before its fenced again) and disable
> the auto start (chkconfig) of the following services: rgmanager, gfs,
clvmd,
> and cman. Then reboot that node again? Then start cman and try to rejoin
> with just the cman_tool?
> 
> The question is, if a server isn't part of a cluster anymore (aka, it was
> rebooted), the cluster obviously recognizes that disconnect and since the
> node was rebooted, it shouldn't even think its part of a cluster. So why
in
> the world does anything think it is?
> 
> All these manually changes after a simple node reboot or fencing just
> doesn't seem like a good design plan. I don't consider myself even
> moderately knowledgeable in this arena, I am just looking at this from a
> design perspective.
> 

I think you have misunderstood my. The point is that if a node leaves
the cluster it really should be rebooted and join the cluster cleanly
that way. There is no manual involvement at all. That's what the init
scripts are for and why they are run at startup.

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com
> [mailto:linux-cluster-bounces at redhat.com] On Behalf Of Christine Caulfield
> Sent: Friday, October 03, 2008 2:39 AM
> To: linux clustering
> Subject: Re: [Linux-cluster] error messages explained
> 
> Mark Chaney wrote:
>> Cam someone explain to me these errors and tell me how I should attempt
to
>> resolve them? They both aren't happening at the same time exactly, its
> just
>> to errors that I don't truly understand.
>>
>> ####################
>>
>> ccsd[3192]: Attempt to close an unopened CCS descriptor (13590).
>> ccsd[3192]: Error while processing disconnect:
>> Invalid request descriptor
>>
>> ##################
>>
>> openais[5453]: [MAIN ] Killing node ratchet.local because it has rejoined
>> the cluster with existing state
>>
> 
> I need to add this to the FAQ!
> 
> What this message means is that a node was a valid member of the cluster
> once; it then left the cluster (without being fenced) and rejoined
> automatically. This can sometimes happen if the ethernet is disconnected
> for a time, usually a few seconds.
> 
> If a node leave the cluster, it MUST rejoin using the cman_tool join
> command with no services running. The usual way to make this happen is
> to reboot the node, and if fencing is configured correctly that is what
> normally happens. It could be that fencing is too slow to manage this or
> that the cluster is made up of two nodes without a quorum disk so that
> the 'other' node doesn't have quorum and cannot initiate fencing.
> 
> Another (more common) cause of this, is slow responding of some Cisco
> switches as documented here:
> 
> http://www.openais.org/doku.php?id=faq:cisco_switches
> 
> 


-- 

Chrissie

--
Linux-cluster mailing list
Linux-cluster at redhat.com
https://www.redhat.com/mailman/listinfo/linux-cluster




More information about the Linux-cluster mailing list