[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] forcefully taking over a service from another node, kdump



Setup: 2 Nodes: node1, node2. IPMI fencing mechanism.

I'm trying to minimize downtime and to get kdump at the same time; while the fail-over process works fine w/o kdump'ing, 
I need to tweak post_fail_delay to be high enough to ensure that the panicking node won't get fenced. 

To ensure that kdump works, I need to set post_fail_delay to 1200 secs (to ensure that dumping process has completed; big memory), and with the post kdump script to sleep for another 1200 seconds.

That way, say node1 panic'ed, it would kdump'ing itself and then would go to sleep for a while. node2 then will fence node1 (reboot it via IPMI) and take over the service most likely when node1 was sleeping at the post kdump.

This has drawbacks of losing service for 1,200 seconds (while kdumping) and assume that kdump'ing will finish at 1,200 seconds. 

=== Working on a new solution ===

I'm working on a solution for this by a kdump_pre script.
When node1 panic'ed, before kdumping, it would contact node2 so that node2 will attempt to take over the service.

At node2, I found <service> running at node1 and issue: 
    clusvcadm -r <service> 

Because of node1's state (it is kdumping), the command just hangs and it did not manage to cut down the service down time.

What can I do at node2 to forcefully take over the service from node1 after node2 is contacted by node1 at kdump_pre stage ?


Thanks
 

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]