[Cluster-devel] fencing conditions: what should trigger a fencing operation?

Fabio M. Di Nitto fdinitto at redhat.com
Thu Nov 19 11:35:05 UTC 2009


Hi guys,

I have just hit what I think it´s a bug and I think we need review our
fencing policies.

This is what I saw:

- 6 nodes cluster (node1-3 x86, node4-6 x86_64)
- node1 and node4 perform a simple mount gfs2 -> wait -> umount -> wait
-> mount -> and loop forever
- node2 and node5 perform read/write operation on the same gfs2
partition (nothing fancy really)
- node3 is in charge of creating and removing clustered lv volumes.
- node6 is in charge of constantly relocating rgmanager services.

cluster is running qdisk too.

It is a known issue that node1 will crash at some point (kernel OOPS).

Here are the interesting bits:

node1 is hanging in mount/umount (expected)
node2, node4, node5 will continue to operate as normal.
node3 is now hanging creating a vg.
node6 is trying to stop service from node1 (it happened to be located
there at the time of the crash).

I was expecting, that after a failure, node1 would be fenced but nothing
is happening automatically.

Manually fencing the node will recover all hanging operations.

Talking to Steven W. it appears that our methods to define and detect a
failure should be improved.

My questions, simply driven by the fact that I am not a fence expert, are:

- what are the current fencing policies?
- what can we do to improve them?
- should we monitor for more failures than we do now?

Cheers
Fabio




More information about the Cluster-devel mailing list