[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Cluster blocked because "waiting for 1 more stopped message"



Hi folks,

testing around with activating/deactivating my cluster logical volumes I drove the cluster into a situation where clvmd on one node was stuck, so I decided to reboot the node. This did not work because the kernel could not unmount some file system. I had to power it off. So far my fault, I thought.

On the other node group_tool dump gave back:
1264171959 0:default waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2 1264171959 2:XenImages waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2 1264171959 1:XenImages waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2 1264171959 1:clvmd waiting for 1 more stopped messages before LEAVE_ALL_STOPPED 2
1264171959 got client 13 dump
And even after rebooting the problem node and restarting cman, clvmd and rgmanager, services on the working node were stuck as well with the above messages being shown.

I did not find any way to push the cluster back into working condition other than rebooting the working node also. Even a "kill -9" on clvmd did not work!

Is there any way to manually fake the awaited "stopped message" to make the rest of the cluster go on? There MUST be, because otherwise this would kill the concept of a cluster on the whole: waiting for a dead nodes last "stopped" message before going on clustering does not make much sense to me.

If anyone out there could help me understand why it is implemented that way and point me at what to do in such a case, I would be very happy.

Dirk



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]