[Linux-cluster] cluster service not running any more

Sun Jul 13 17:23:50 UTC 2008

Hi folks,

I have setup a cluster on 5.2 with system-config-cluster. It is quite 
simple: the only service is an ip ressource that is switched.

The cluster has started up fine the first time, the virtual ip was where 
ist belonged. Since then I have not changed anything, I simply had to 
restart the machines for other reasons.

Now nothing works as it should:
- shutting down clurgmgrd normally (service rgmanager stop) is impossible; 
even kill -9 does not work. I have to call "reboot" twice to force a reboot 
to stop clurgmgrd.
- after reboot I can manually start the cluster again (did not venture to 
do it with system startup), the daemons start, nothing unusual is logged, 
but
  a) the service containing the ip ressource is not started
  b) clustat on the primary node moans a "timed out trying to connect to 
Ressource Group Manager"
  c) clustat on both nodes shows the node state, but does not list the 
service

I have tried everything to get the environement clean (shutdown the 
firewall, set selinux to permissive, etc.), but the result is always the 
same. Since I did not change anything after the first successfull start of 
the cluster, I wonder
- if there is some run time data/temporary files the ressource group 
manager writes to disk and tries to reread after reboot (remember, I had to 
kill it by violent force to be able to reboot my machines)
- if it is possible at all to successfully run a cluster with cman and 
clurgmgrd.

In case it helps here is my cluster.conf:

<?xml version="1.0" ?>
<cluster config_version="5" name="GatewayCluster">
	<fence_daemon post_fail_delay="0" post_join_delay="3"/>
	<clusternodes>
		<clusternode name="rtr1hb" nodeid="1" votes="1">
			<fence>
				<method name="1">
					<device name="fence1" nodename="rtr1hb"/>
				</method>
			</fence>
		</clusternode>
		<clusternode name="rtr2hb" nodeid="2" votes="1">
			<fence>
				<method name="1">
					<device name="fence2" nodename="rtr2hb"/>
				</method>
			</fence>
		</clusternode>
	</clusternodes>
	<cman expected_votes="1" two_node="1"/>
	<fencedevices>
		<fencedevice agent="fence_manual" name="fence1"/>
		<fencedevice agent="fence_manual" name="fence2"/>
	</fencedevices>
	<rm>
		<failoverdomains>
			<failoverdomain name="Gateway1" ordered="1" restricted="1">
				<failoverdomainnode name="rtr1hb" priority="1"/>
				<failoverdomainnode name="rtr2hb" priority="2"/>
			</failoverdomain>
		</failoverdomains>
		<resources>
			<ip address="IP Address" monitor_link="1"/>
		</resources>
		<service autostart="1" domain="Gateway1" name="Gateway1-IP">
			<ip ref="IP Address"/>
		</service>
	</rm>
</cluster>

The logs show the nodes successfully joining the cluster and such stuff and 
as last clurgmgrd starting, then nothing more from cluster daemons.

Any hint or help is appreciated. I am stuck and do not know where to look 
at.

Dirk