[Linux-cluster] fence ' node1' failed if etho down

spods at iinet.net.au spods at iinet.net.au
Tue Feb 17 03:56:56 UTC 2009


A couple of things.

You don't have any fencing devices defined in cluster.conf at all.  No power
fencing, no I/O fencing, not even manual fencing.

You need to define how each node of the cluster is to be fenced (forcibly removed
from the cluster) for proper failover operations to occur.

Secondly, if the only connection shared between the two nodes is the network cord
you just disconnected, then of course nothing will happen - each node has just
lost the only common connection between each other to control the faulty node
(i.e. through fencing).

There need's to be more connections in between the nodes of a cluster than just a
network card.  This can be achieved with a second NIC, I/O fencing, centralised
or individual power controls (I/O switches or IPMI).

That way in the event that the network connection is the single point of failure
between the two nodes, at least a node can be fenced if it's behaving improperly.

Once the faulty node is fenced, the remaining nodes should at that point continue
providing cluster services.

Regards,

Stewart




On Mon Feb 16 16:29 , ESGLinux  sent:

>Hello All, 
>
>I have a cluster with two nodes running one service (mysql). The two nodes uses
a ISCSI disk with gfs on it. 
>I haven´t configured fencing at all. 
>
>I have tested diferent situtations of fail and these are my results:
>
>
>If I halt node1 the service relocates to node2 - OK
>if I kill the process in node1 the services relocate to node2 - OK
>
>but
>
>if I unplug the wire of the ether device or make ifdown eth0 on node1 all the
cluster fails. The service doesn´t relocate. 
>
>In node2 I get the messages:
>
>Feb 15 13:29:34 localhost fenced[3405]: fencing node "192.168.1.188"
>Feb 15 13:29:34 localhost fenced[3405]: fence "192.168.1.188" failed
>Feb 15 13:29:39 localhost fenced[3405]: fencing node "192.168.1.188"
>
>Feb 15 13:29:39 localhost fenced[3405]: fence "192.168.1.188" failed
>
>again and again. The node2 never runs the service and I try to reboot the node1
the computer hangs waiting for stopping the services. 
>
>
>In this situation all I can do is to switch off the power of node1 and reboot
the node2. This situation is not acceptable at all. 
>
>I think the problem is just with fencing but I dont know how to apply to this
situation ( I have RTFM from redhat site  but I have seen how to apply it. :-( )
>
>
>this is my cluster.conf file
>
><cluster alias="MICLUSTER" config_version="62" name="MICLUSTER">
>        <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
>
>        <clusternodes>
>                <clusternode name="node1" nodeid="1" votes="1">
>                        <fence/>
>                </clusternode>
>                <clusternode name="node2" nodeid="2" votes="1">
>
>                        <fence/>
>                </clusternode>
>        </clusternodes>
>        <cman expected_votes="1" two_node="1"/>
>        <fencedevices/>
>
>        <rm>
>                <failoverdomains>
>                        <failoverdomain name="DOMINIOFAIL" nofailback="0"
ordered="0" restricted="1">
>                                <failoverdomainnode name="node1" priority="1"/>
>
>                                <failoverdomainnode name="node2" priority="1"/>
>                        </failoverdomain>
>                </failoverdomains>
>                <resources/>
>
>                <service domain="DOMINIOFAIL" exclusive="0" name="BBDD"
revovery="restart">
>                        <mysql config_file="/etc/my.cnf" listen_address=""
mysql_options="" name="mydb" shutdown_wait="3"/>
>
>                        <ip address="192.168.1.183" monitor_link="1"/>
>                </service>
>        </rm>
></cluster>
>
>Any idea? references?
>
>Thanks in advance
>
>
>Greetings
>
>ESG
>
>
>
>






More information about the Linux-cluster mailing list