[Linux-cluster] Expected qdiskd behaviour on node failure and reboot implementation

Rafael Micó Miranda rmicmirregs at gmail.com
Tue Jan 19 22:27:29 UTC 2010


Hi all,

Today I was shocked while making a test to one of my cluster
configurations.

A) Environment
- 6 x different servers used as cluster nodes, with dual FC HBA
- iLO/DRAC fencing devices for each cluster node
- 2 x different fabrics, each build with 3 FC SAN switches
- 2 x storage arrays, with 23 270GB LUNs of data each. 
- 1x Qdisk: a 24th LUN located in one of the storage arrays
- 2x qdisk heuristics

B) Test
- Removing the 2 service interface wires on node A

C) Expected behaviour (due to qdisk and cman timers)
- Qdiskd should notice the lost of the heuristics on node A
- CMAN should notice the lost of connectivity with node A
- The rest of the nodes should fence node A

D) Experienced behaviour
- Qdisk notices the lost of the heuristic on node A
- Qdisk reboots via "hard reset" node A
- CMAN notices the lost of connectivity with node A
- The rest of the nodes fence it (I see the 2 reboots in the iLO log of
the system)

I was shocked with the capacity of Qdisk of doing a "hard reset" of the
system. I mean: it was not a clean shutdown of the system via a "reboot"
or "poweroff" O.S. command. It was more likely to be a power reset in
the system. I was expecting to qdisk to stop the CMAN service or, in the
strongest situation, doing a clean reboot of the system.

After that, I found this in the qdisk man page:

"By default, only nodes scoring over 1/2 of the total maximum score will
claim they are available via the quorum disk, and a node (master or
otherwise) whose score drops too low will remove itself (usually, by
rebooting).
[...]
reboot="1"
        If set to 0 (off), qdiskd will *not* reboot after a negative
        transition as a result in a change in score (see section 2.2).
        The default for this value is 1 (on)."
        
So my thoughts were wrong and this is the default behaviour, isn't it?
I'm pretty sure in my previous tests I did not see this behaviour.

Another question is: how does qdisk implement the "reboot" function? Is
it really a "hard reset"? 

Thanks in advance,

Rafael

-- 
Rafael Micó Miranda




More information about the Linux-cluster mailing list