[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Configuring rgmanager





Failover will not occur until after CMAN (or gulm) says the node is dead and has been fenced. When using the kernel Service Manager (provided by CMAN), recovery is in the following order:

(1) Fencing
(2) Locking
(3) GFS
(4) User services (e.g. rgmanager)

How long did you wait? :)



Results from my tests with two nodes(buba and gump)(and latest cvs(update done today)):
I tried to put a basic script in failover on two nodes.
Initialization:
-ccsd, cman_tool join fence_tool join on the two nodes
Then I start the rgmanager on the two nodes:
the script coucou (echo `uname -n` >> bla.txt) is launched on one of the two nodes.
With clusvcadm I made this script ran on gump, and I rebooted gump:
There is the syslog on buba:
Feb 28 13:20:17 buba kernel: CMAN: removing node gump from the cluster : Missed too many heartbeats
Feb 28 13:20:17 buba fenced[7573]: gump not a cluster member after 0 sec post_fail_delay
Feb 28 13:20:17 buba fenced[7573]: fencing node "gump"
Feb 28 13:20:20 buba fence_manual: Node 200.0.0.102 needs to be reset before recovery can procede. Waiting for 200.0.0.102 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 200.0.0.102)
Feb 28 13:20:29 buba fenced[7573]: fence "gump" success
Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Taking over resource group coucou from down member (null)
Feb 28 13:20:32 buba clurgmgrd[7581]: <notice> Resource group coucou started


Then gump came and rejoined the cluster: syslog of buba:
Feb 28 13:23:47 buba kernel: CMAN: node gump rejoining

I put the script on gump(always with clusvcadm):
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Stopping resource group coucou
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is stopped
Feb 28 13:24:50 buba clurgmgrd[7581]: <notice> Resource group coucou is now running on member 2


Then I re rebooted (:)) gump and there came the problems:
Gump was removed from the cluster
Feb 28 13:25:57 buba kernel: CMAN: removing node gump from the cluster : Missed too many heartbeats
Feb 28 13:25:57 buba fenced[7573]: gump not a cluster member after 0 sec post_fail_delay
Feb 28 13:25:57 buba fenced[7573]: fencing node "gump"
Feb 28 13:26:03 buba fence_manual: Node 200.0.0.102 needs to be reset before recovery can procede. Waiting for 200.0.0.102 to rejoin the cluster or for manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 200.0.0.102)
Feb 28 13:26:14 buba fenced[7573]: fence "gump" success


And there, the rgmanager did nothing
when I looked to /proc/cluster/services I had:
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 recover 2 -
[1]

Whereas I had, during the first reboot of gump:

Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1]

DLM Lock Space:  "Magma"                             3   4 run       -
[1]

User:            "usrm::manager"                     2   3 run       -
[1]

Then I tried to bring gump back:

And there is what I had in gump:
[root gump ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

User: "usrm::manager" 0 3 join S-1,80,2
[]


So there, nothing worked, I hopelessly tried to restart the rgmanager on the two nodes, but nothing worked, I had states where in gump

[root gump ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

DLM Lock Space:  "Magma"                             3   6 run       -
[2]

User:            "usrm::manager"                     4   5 run       -
[2]

and in buba:
[root buba ~]# cat /proc/cluster/services
Service          Name                              GID LID State     Code
Fence Domain:    "default"                           1   2 run       -
[1 2]

(buba seems not to have any clurgmrgrd running, even if I started the rgmanager...)

I don't know if it's a bug of the rgmanager or if I'm doing something wrong, but I don't understand why during the first reboot everything worked and nothing then...










[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]