[Linux-cluster] rgmanager and clvm don't work after reboot


I have a cluster of three nodes (node1, node2, node3) with qdisk,
running clvmd on storage. Current uptime is 22 days. Today I had to turn
off one node for hardware maintanance. After powering the node on, node
join cluster, but clvmd/rgmanager just get stuck. I suppose they can't
communicate with cman...  Other nodes see the new node, and node sees
them, but rgmanager is just stuck. On the booted node I see this:

# clustat
Cluster Status for one2play-c00 @ Tue Mar 23 21:06:29 2010
Member Status: Quorate

 Member Name                            ID   Status
 ------ ----                            ---- ------
 node1                                      1 Online, Local
 node2                                      2 Online
 node3                                      3 Online
 /dev/dm-0                                  0 Online, Quorum Disk

No services at all.... So I've disabled startup of rgmanager and clvmd:

# chkconfig --level 2345 rgmanager off
# chkconfig --level 2345 clvmd off

And tried to reboot the node. Node gets stuck with stopping rgmanager.
So I killed the stop script. After that shutdown continued, but halted
again, after leaving the qdisk... Only thing I could do then was to
reboot the node (issuing fence_node node1 on node2).

Now, after startup, I have the same exact issue. Here I try manually to
start clvmd:

# /etc/init.d/clvmd start
Starting clvmd: clvmd startup timed out
# ps -ef | grep clvm
root      6162     1  0 20:59 ?        00:00:00 clvmd -T20

So obviously something's wrong. I've seen this issue before, and I
resolved it by rebooting all the cluster members... But that solution is
out of question in the long term...

Do you have any ideas?

I've found this old thread, with the following explanation:
> Rgmanager thinks qdisk is a node (with node ID 0), so it tries to send
> VF information to node 0 - which doesn't exist, causing rgmanger to
> not work when qdisk is running :(


I this a similar issue? Services trying to communicate with member 0,
which is a qdisk and not a real member? :-/

