[Linux-cluster] pull plug on node, service never relocates

Sat May 15 13:26:49 UTC 2010

Hello,

You might want to check the syslog to see if the cluster has noticed the
outage and what is has tried to do about it.
You can also check the node status via 'cman nodes' (explanaation of states
in the cman manpage).
Does the server have another power source, by any chance?
  (if not make sure you DO have dual power supplies. These things die Often)

Regards,

Kit

  _____  

From: linux-cluster-bounces at redhat.com
[mailto:linux-cluster-bounces at redhat.com] On Behalf Of Dusty
Sent: vrijdag 14 mei 2010 21:45
To: Linux-cluster at redhat.com
Subject: [Linux-cluster] pull plug on node, service never relocates

Greetings,

Using stock "clustering" and "cluster-storage" from RHEL5 update 4 X86_64
ISO.

As an example using my below config: 

Node1 is running service1, node2 is running service2, etc, etc, node5 is
spare and available for the relocation of any failover domain / cluster
service.

If I go into the APC PDU and turn off the electrical port to node1, node2
will fence node1 (going into the APC PDU and doing and off, on on node1's
port), this is fine. Works well. When node1 comes back up, then it shuts
down service1 and service1 relocates to node5.

Now if I go in the lab and literally pull the plug on node5 running
service1, another node fences node5 via the APC - can check the APC PDU log
and see that it has done an off/on on node5's electrical port just fine.

But I pulled the plug on node5 - resetting the power doesn't matter. I want
to simulate a completely dead node, and have the service relocate in this
case of complete node failure.

In this RHEL5.4 cluster, the service never relocates. I can similate this on
any node for any service. What if a node's motherboard fries? 

What can I set to have the remaining nodes stop waiting for the reboot of a
failed node and just go ahead and relocate the cluster service that had been
running on the now failed node?

Thank you!

versions:

cman-2.0.115-1.el5
openais-0.80.6-8.el5
modcluster-0.12.1-2.el5
lvm2-cluster-2.02.46-8.el5
rgmanager-2.0.52-1.el5
ricci-0.12.2-6.el5

cluster.conf (sanitized, real scripts removed, all gfs2 mounts gone for
clarity):
<?xml version="1.0"?>
<cluster config_version="1"
name="alderaanDefenseShieldRebelAllianceCluster">
    <fence_daemon clean_start="0" post_fail_delay="3" post_join_delay="60"/>
    <clusternodes>
        <clusternode name="192.168.1.1" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="apc_pdu" port="1" switch="1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.2" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="apc_pdu" port="2" switch="1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.3" nodeid="3" votes="1">
            <fence>
                <method name="1">
                    <device name="apc_pdu" port="3" switch="1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.4" nodeid="4" votes="1">
            <fence>
                <method name="1">
                    <device name="apc_pdu" port="4" switch="1"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="192.168.1.5" nodeid="5" votes="1">
            <fence>
                <method name="1">
                    <device name="apc_pdu" port="5" switch="1"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="6"/>
    <fencedevices>
        <fencedevice agent="fence_apc" ipaddr="192.168.1.20" login="device"
name="apc_pdu" passwd="wonderwomanWasAPrettyCoolSuperhero"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="fd1" nofailback="0" ordered="1"
restricted="1">
                <failoverdomainnode name="192.168.1.1" priority="1"/>
                <failoverdomainnode name="192.168.1.2" priority="2"/>
                <failoverdomainnode name="192.168.1.3" priority="3"/>
                <failoverdomainnode name="192.168.1.4" priority="4"/>
                <failoverdomainnode name="192.168.1.5" priority="5"/>
            </failoverdomain>
            <failoverdomain name="fd2" nofailback="0" ordered="1"
restricted="1">
                <failoverdomainnode name="192.168.1.1" priority="5"/>
                <failoverdomainnode name="192.168.1.2" priority="1"/>
                <failoverdomainnode name="192.168.1.3" priority="2"/>
                <failoverdomainnode name="192.168.1.4" priority="3"/>
                <failoverdomainnode name="192.168.1.5" priority="4"/>
            </failoverdomain>
            <failoverdomain name="fd3" nofailback="0" ordered="1"
restricted="1">
                <failoverdomainnode name="192.168.1.1" priority="4"/>
                <failoverdomainnode name="192.168.1.2" priority="5"/>
                <failoverdomainnode name="192.168.1.3" priority="1"/>
                <failoverdomainnode name="192.168.1.4" priority="2"/>
                <failoverdomainnode name="192.168.1.5" priority="3"/>
            </failoverdomain>
            <failoverdomain name="fd4" nofailback="0" ordered="1"
restricted="1">
                <failoverdomainnode name="192.168.1.1" priority="3"/>
                <failoverdomainnode name="192.168.1.2" priority="4"/>
                <failoverdomainnode name="192.168.1.3" priority="5"/>
                <failoverdomainnode name="192.168.1.4" priority="1"/>
                <failoverdomainnode name="192.168.1.5" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources>
            <ip address="10.1.1.1" monitor_link="1"/>
            <ip address="10.1.1.2" monitor_link="1"/>
            <ip address="10.1.1.3" monitor_link="1"/>
            <ip address="10.1.1.4" monitor_link="1"/>
            <ip address="10.1.1.5" monitor_link="1"/>
            <script file="/usr/local/bin/service1" name="service1"/>
            <script file="/usr/local/bin/service2" name="service2"/>
            <script file="/usr/local/bin/service3" name="service3"/>
            <script file="/usr/local/bin/service4" name="service4"/>
       </resources>
        <service autostart="1" domain="fd1" exclusive="1" name="service1"
recovery="relocate">
            <ip ref="10.1.1.1"/>
            <script ref="service1"/>
        </service>
        <service autostart="1" domain="fd2" exclusive="1" name="service2"
recovery="relocate">
            <ip ref="10.1.1.2"/>
            <script ref="service2"/>
        </service>
        <service autostart="1" domain="fd3" exclusive="1" name="service3"
recovery="relocate">
            <ip ref="10.1.1.3"/>
            <script ref="service3"/>
        </service>
        <service autostart="1" domain="fd4" exclusive="1" name="service4"
recovery="relocate">
            <ip ref="10.1.1.4"/>
            <script ref="service4"/>
        </service>
    </rm>
</cluster>

No virus found in this incoming message.
Checked by AVG - www.avg.com
Version: 9.0.819 / Virus Database: 271.1.1/2874 - Release Date: 05/14/10
20:26:00

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100515/4bc55bbe/attachment.htm>