[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Cluster failing after rebooting a standby node



Hello all,

We've been encountering an issue with RHCS4 U6 (using the U5 version of system-config-cluster as U6 version is broken) that results in the cluster failing after rebooting one of the standby nodes with CMAN dieing after too many transition restarts.

We have a 7 node cluster, with 5 active nodes and 2 standby nodes. We are running the cluster with broadcast mode for cluster communication (the default for CS4), changing to multicast isn't an option at the moment due to us using Cisco switching infrastructure. The hardware we're running the cluster on are IBM HS21 blades within 2 IBM H series Bladechassis (3 within one chassis, 4 in another). Each Bladechassis network switch module has dual gig uplinks to a Cisco switch.

We have done a lot of analysis of our network to ensure that the problem is not being caused by the underlying network preventing the cluster nodes from talking to one another, so we have ruled this out as a cause of the problem.

The cluster is currently a pre-production system that we are testing before putting into production so the nodes are basically sitting idle at the moment whilst we test things (i.e. the cluster).

What we have seen happening, is that we have the cluster operational for several days and when initiating a reboot of one of the standby nodes (that isn't running any clustered services at the time), the other cluster nodes start filling the logs with:

Apr 14 15:44:57 server01 kernel: CMAN: Initiating transition, generation 64
Apr 14 15:45:12 server01 kernel: CMAN: Initiating transition, generation 65

With the generation number increasing until CMAN dies with:

Apr 14 15:48:24 server01 kernel: CMAN: too many transition restarts - will die Apr 14 15:48:24 server01 kernel: CMAN: we are leaving the cluster. Inconsistent cluster view
Apr 14 15:48:24 server01 kernel: SM: 01000004 sm_stop: SG still joined
Apr 14 15:48:24 server01 kernel: SM: 03000003 sm_stop: SG still joined
Apr 14 15:48:24 server01 clurgmgrd[22461]: <warning> #67: Shutting down uncleanly Apr 14 15:48:24 server01 ccsd[7135]: Cluster manager shutdown. Attemping to reconnect... Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate. Refusing connection. Apr 14 15:48:25 server01 ccsd[7135]: Error while processing connect: Connection refused
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something evil. Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-111).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something evil. Apr 14 15:48:25 server01 ccsd[7135]: Error while processing get: Invalid request descriptor
Apr 14 15:48:25 server01 ccsd[7135]: Invalid descriptor specified (-21).
Apr 14 15:48:25 server01 ccsd[7135]: Someone may be attempting something evil. Apr 14 15:48:25 server01 ccsd[7135]: Error while processing disconnect: Invalid request descriptor Apr 14 15:48:25 server01 ccsd[7135]: Cluster is not quorate. Refusing connection.

The interesting thing is that immediately after rebooting all of the nodes within the cluster and restarting the cluster services, the problem cannot be replicated. Typically the cluster system has to have been running for 3-4 days untouched before we can then replicate the problem again (i.e. I reboot one of the standby nodes and it fails again).

I made a change yesterday to cluster.conf to increase the logging facility and logging level (set it to debug level - 7) and after using ccs_tool to apply the changes to the cluster online, once again I can't replicate the problem (even though immediately before this I could replicate the problem).

Has anyone experienced anything even remotely similar to this (I couldn't see anything similar reported in the list archives) and/or have any suggestions as to what might be causing the issue?

Cheers,

Ben


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]