[Linux-cluster] GFS problem


I have a problem with shared GFS resource on a 12-node Cluster Manager cluster.

The cluster starts up properly if all nodes are booted at once. Any major interaction with one of the nodes (reboot, cman restart) causes the GFS to lock out the GFS, and for the cluster to fal into some unstable split state.

In this state, logs, clustat and "cman_tool status" report the cluster as fully connected and working, while "cman_tool resources" reports only the fence resource in JOIN_START_WAIT (or JOIN_STOP WAIT, depending on what was done to the cluster in the meantime) state with overlapping but different node sets, depending on the node I run the "cman_tool resources" command.

So far, the only functioning method to get the cluster out of the state is to manually reboot all the nodes at once, but this is unfeasible due to uptime expectations and high load carried by the cluster.

We're completely in the dark about the possible cause of the problem, any help is appreciated.



