[Linux-cluster] clustat stuck

Fri Apr 1 20:42:34 UTC 2011

The pithy ruminations from frederic randriamora on Oct 29, 2010 4:30:03 pm entitled"RE: [Linux-cluster] clustat stuck" were:

==> Hi,
==> 
==> I have a 4 node cluster, with multipathed qdisk on a san. The nodes are
==> running redhat 5.4.

I've got a 3 node cluster, with multipathed qdisk on a SAN. The nodes are
running CentOS 5.5:

	Linux 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux

	lvm2-cluster-2.02.56-7.el5_5.4
	cman-2.0.115-34.el5_5.4
	rgmanager-2.0.52-6.el5.centos.8
	openais-0.80.6-16.el5_5.9

==> 
==> After a minor change made in cluster.conf on node3 properly propagated
==> by ccs_tool update, clustat is no longer correctly responding in the
==> other 3 nodes.

In my case, I failed a service from node3 ==> node2, but made no cluster
configuration changes.

==> node3 is neither nodeid 1 nor qdisk master.
==> 
==> clustat on node3 runs fine

Similar. On node2, clustat works fine.

==> 
==> clustat on the other nodes
==> 
==> either hangs with
==> connect(8, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...}, 110
==> from strace
==> 
==> 
==> or times out with
==> Timed out waiting for a response from Resource Group Manager
==> without displaying the still running services
==> 

Exactly the same behavior here.

==> cman_tool services et al. are just fine everywhere,
==> 

Agreed. The actual sevices are running on each node. The report from cman_tool
is correct, but querying the cluster with "clustat" or operations with
"cluscvadm" hang or timeout.

==> Although all the services are running fine, I cannot move/stop them
==> anymore with clusvcadm.
==> 
==> How to get out of that situation?

Is there any solution to this issue?

Thanks,

Mark