[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Cluster failing after rebooting a standby node



On Wed, 2008-04-23 at 11:33 +0800, Ben J wrote:

> What we have seen happening, is that we have the cluster operational for 
> several days and when initiating a reboot of one of the standby nodes 
> (that isn't running any clustered services at the time), the other 
> cluster nodes start filling the logs with:
> 
> Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64
> Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65
> 
> With the generation number increasing until CMAN dies with:
> 
> Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts - 
> will die
> Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster. 
> Inconsistent cluster view

^^^^ This is the problem.

vvvv These are all caused by that problem, and will 
     go away when the above is resolved.

> Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined
> Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined
> Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down 
> uncleanly
> Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown. 
>  Attemping to reconnect...
> <snip>...
> Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect: 
> Invalid request descriptor
> Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate.  Refusing 
> connection.


> The interesting thing is that immediately after rebooting all of the 
> nodes within the cluster and restarting the cluster services, the 
> problem cannot be replicated.  Typically the cluster system has to have 
> been running for 3-4 days untouched before we can then replicate the 
> problem again (i.e. I reboot one of the standby nodes and it fails again).
> 
> I made a change yesterday to cluster.conf to increase the logging 
> facility and logging level (set it to debug level - 7) and after using 
> ccs_tool to apply the changes to the cluster online, once again I can't 
> replicate the problem (even though immediately before this I could 
> replicate the problem).

On RHEL4, there's some ugly arcane thing you need to do after this:

  cman_tool version -r <new_config_version>

I'm not sure this is the cause of the 'too many transitions' problem you
hit.  (Unfortunately, I'm not one of the people who fully understands
what causes 'too many transitions'...)

-- Lon


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]