Unfortunately, in a rush of desperation I rebooted the complete
cluster and am now waiting for a new gfs_fsck to finish.|
However I did find there was a segfault in one (unused) ethernet interface. Maybe this causes communication problems (lost ack...).
I will test this hypothesis tomorrow morning (I am on my way out of office because of the time zone...).
Thanks for your very useful hint and explanation.
David Teigland wrote:
On Thu, Apr 20, 2006 at 09:56:20AM +0200, Fernando Nino wrote:I am running GFS 6.1 with dlm on a cluster (4 nodes + front-end) of dual-headed Opterons and RHEL4U3. Because of some problems (kernel panic...) I had to hard boot some nodes of the cluster. Now, some gfs partitions won't mount. They will simply keep waiting forever for the "join" of the GFS group: So... three questions: - What is the join exactly doing ? Cluster status is fine, everybody is member ...>From all 5 nodes it would be good to see: - cman_tool services - /var/log/messages - /proc/cluster/lock_dlm/debug- What does the status code mean in the cman_tool output ? S-2,2,4S-2: join event state is SEST_JOIN_ACKWAIT ,2: join event flag is SEFL_ALLOW_JOIN ,4: number of acks to our join request is 4 So, the node is waiting for acks to its join request. It needs 5 but has only got 4, someone hasn't sent a reply for some reason. We might be able to figure out who and why given all the info from the other nodes. Rebooting the node that's not replied might resolve things. Dave