[Linux-cluster] fencing loop in a 2-node partitioned cluster

Tue Feb 24 08:26:03 UTC 2009

Actually my situation is pretty different and worse.
two nodes cluster with qdisk and hp ilo based fencing, components rh
el 5U3 based.
if I panic a node, the other correctly fence it with default action of
rebooting it. And also the converse is true.
But if for example I get down the intracluster network (it is bonded
actually, but I'm trying to repdoduce as many scenarios I can), the
reaction is that each node fences the other, but they both remains in
power off mode... so no loop at all

here is relvant information from my cluster.conf:
<?xml version="1.0"?>
<cluster alias="oracs" config_version="46" name="oracs">
        <cman expected_votes="3" two_node="0"/>
        <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="20"/>
        <clusternodes>
                <clusternode name="node01" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilonode01"/>
                                </method>
                        </fence>
                </clusternode>
                <clusternode name="node02" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="ilonode02"/>
                                </method>
                        </fence>
                </clusternode>
        </clusternodes>
        <quorumd device="/dev/mapper/mpath3" interval="3"
label="acsquorum" log_facility="local4" log_level="7" tko="5"
votes="1">
                <heuristic interval="2" program="ping -c1 -w1
10.4.5.250" score="1" tko="3"/>
        </quorumd>
        <fencedevices>
                <fencedevice agent="fence_ilo" hostname="10.4.192.208"
login="fenceuser" name="ilonode01" passwd="xxxxx"/>
                <fencedevice agent="fence_ilo" hostname="10.4.192.209"
login="fenceuser" name="ilonode02" passwd="xxxxx"/>
        </fencedevices>

the heuristic ip of qdisk is on production lan (10.4.5.x), while
intracluster is on another lan (192.168.16.x).
Is there any parameter I can configure to prevent this situation or is
it by design?
I would expect one (for example quorum master node) to survive and
successfully fence the other...
Also because in this scenario I have:
- both nodes see the SAN and the quorum disk
- both nodes see production LAN
- both nodes see the status of the other one via iLO commands

I remember also old kimberlite was able to configure more than one
intracluster lan...?
my components are:

cman-2.0.98-2chrissie (a patched cman after 5U3 because of this:
https://bugzilla.redhat.com/show_bug.cgi?id=485026)
rgmanager-2.0.46-1.el5
openais-0.80.3-22.el5

Any suggestions are welcome.

Thanks Gianluca