[Linux-cluster] Cluster blocked because "waiting for 1 more stopped message"
Dirk H. Schulz
dirk.schulz at kinzesberg.de
Sat Jan 23 10:31:49 UTC 2010
Hi folks,
testing around with activating/deactivating my cluster logical volumes I
drove the cluster into a situation where clvmd on one node was stuck, so
I decided to reboot the node.
This did not work because the kernel could not unmount some file system.
I had to power it off. So far my fault, I thought.
On the other node group_tool dump gave back:
> 1264171959 0:default waiting for 1 more stopped messages before
> LEAVE_ALL_STOPPED 2
> 1264171959 2:XenImages waiting for 1 more stopped messages before
> LEAVE_ALL_STOPPED 2
> 1264171959 1:XenImages waiting for 1 more stopped messages before
> LEAVE_ALL_STOPPED 2
> 1264171959 1:clvmd waiting for 1 more stopped messages before
> LEAVE_ALL_STOPPED 2
> 1264171959 got client 13 dump
And even after rebooting the problem node and restarting cman, clvmd and
rgmanager, services on the working node were stuck as well with the
above messages being shown.
I did not find any way to push the cluster back into working condition
other than rebooting the working node also. Even a "kill -9" on clvmd
did not work!
Is there any way to manually fake the awaited "stopped message" to make
the rest of the cluster go on? There MUST be, because otherwise this
would kill the concept of a cluster on the whole: waiting for a dead
nodes last "stopped" message before going on clustering does not make
much sense to me.
If anyone out there could help me understand why it is implemented that
way and point me at what to do in such a case, I would be very happy.
Dirk
More information about the Linux-cluster
mailing list