Re: [Linux-cluster] post_fail_delay versus deadnode_timeout

Riaan van Niekerk wrote:

We are trying to capture diskdumps when a lock_dlm kernel panic happens and need to increase either post_fail_delay or deadnode_timeout to prevent the dumping node from being fenced.

Is there any advantages or disadvantages to using either? Which is recommended?

post_fail_delay and diskdump has come up previously, with some good answers from David

note: for capturing a "sysrq t", we manually increase deadnode_timeout, and decrease it back again, but don't have this luxury with a kernel panic (which can happen at any time).


Having spent some time researching this, and with some help from Red Hat Support, here is an attempt at an answer. I use power-fencing. Some of these might not apply to I/O fencing:


- single place to change it (cluster.conf) makes it global across the cluster - If failed node is detected, resources will relocated immediately (instead of waiting for the deadnode_timeout to be reached and then relocate)
- usage case: post-kernel panic, when you need to capture a disk-/netdump

- Fence daemon needs to be restarted to apply (e.g. in all likelihood you need to reboot all nodes) - Slight annoyance: depending on how long you set the post_fail_delay, a node may be restarting already, and is then fenced, requiring another restart.


- can be set dynamically
- useful if you have warning that the problem will materialize (we have a scenario like that) - usage case: when you need to run "sysrq t" or some intrusive command which would cause a node to be fenced otherwise: Increase, sysrq, decrease

- need to set on all nodes
- Not persistent. Need to hack cman init script to make persistent.

corrections/additions welcome
