Re: [Linux-cluster] clustat stuck

The pithy ruminations from frederic randriamora on Oct 29, 2010 4:30:03 pm entitled"RE: [Linux-cluster] clustat stuck" were:

==> Hi,
==> I have a 4 node cluster, with multipathed qdisk on a san. The nodes are
==> running redhat 5.4.

I've got a 3 node cluster, with multipathed qdisk on a SAN. The nodes are
running CentOS 5.5:

	Linux 2.6.18-194.32.1.el5 #1 SMP Wed Jan 5 17:52:25 EST 2011 x86_64 x86_64 x86_64 GNU/Linux


==> After a minor change made in cluster.conf on node3 properly propagated
==> by ccs_tool update, clustat is no longer correctly responding in the
==> other 3 nodes.

In my case, I failed a service from node3 ==> node2, but made no cluster
configuration changes.

==> node3 is neither nodeid 1 nor qdisk master.
==> clustat on node3 runs fine

Similar. On node2, clustat works fine.

==> clustat on the other nodes
==> either hangs with
==> connect(8, {sa_family=AF_FILE, path="/var/run/cluster/rgmanager.sk"...}, 110
==> from strace
==> or times out with
==> Timed out waiting for a response from Resource Group Manager
==> without displaying the still running services

Exactly the same behavior here.

==> cman_tool services et al. are just fine everywhere,

Agreed. The actual sevices are running on each node. The report from cman_tool
is correct, but querying the cluster with "clustat" or operations with
"cluscvadm" hang or timeout.

==> Although all the services are running fine, I cannot move/stop them
==> anymore with clusvcadm.
==> How to get out of that situation?

Is there any solution to this issue?



