[Linux-cluster] RHEL5 failover domain

Dariusz Skorupa d.skorupa at wasko.pl
Mon Nov 19 12:13:07 UTC 2007


    I've found problem with RHEL5 cluster. When I use prioritized fail 
over domain and next reset the node witch have priority set to 1 cluster 
relocate service to node with priority 2. Next, when node 1 come back 
,cluster is trying to relocate service back to primary node.  In logfile 
I always find:

Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.10
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
service:vsftpd
Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status

   I tested this many times and in this case clurgmgrd do not try to run 
script with stop parameter, but when I try to relocate service manualy 
using clusvcadm or when both nodes have priority 1 everything is 
successful. Is also successful if I'm restarting node using reboot. I 
think automatic (after crash) relocating starts too early. In my opinion 
cluster do not wait for rgmanager start.

now i tried packages:
    rgmanager-2.0.23-1 or rgmanager-2.0.28-1.el5
    cman-2.0.73-1.el5_1.1, cman-2.0.64 and earlier from RHEL5 CD
    openais-0.80.3-7.el5 and earlier from RHEL5 CD


My cluster.conf file:

<?xml version="1.0"?>
<cluster alias="OBN_HA" config_version="26" name="OBN_HA">
    <fence_daemon clean_start="0" post_fail_delay="0" post_join_delay="3"/>
    <clusternodes>
        <clusternode name="l2.local" nodeid="1" votes="1">
            <fence>
                <method name="1">
                    <device name="l2_fence" nodename="l2.local"/>
                </method>
            </fence>
        </clusternode>
        <clusternode name="l1.local" nodeid="2" votes="1">
            <fence>
                <method name="1">
                    <device name="l1_fence" nodename="l1.local"/>
                </method>
            </fence>
        </clusternode>
    </clusternodes>
    <cman expected_votes="1" two_node="1"/>
    <fencedevices>
        <fencedevice agent="fence_manual" name="l1_fence"/>
        <fencedevice agent="fence_manual" name="l2_fence"/>
    </fencedevices>
    <rm>
        <failoverdomains>
            <failoverdomain name="OBN" ordered="1" restricted="0">
                <failoverdomainnode name="l1.local" priority="1"/>
                <failoverdomainnode name="l2.local" priority="2"/>
            </failoverdomain>
        </failoverdomains>
        <resources/>
        <service autostart="1" domain="OBN" name="vsftpd" 
recovery="relocate">
            <script file="/etc/init.d/vsftpd" name="vsftpd"/>
        </service>
    </rm>
</cluster>

Full /dev/log/messages log :
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Sending initial ORF token
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:39 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:39 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:27:39 l2 openais[1977]: [CMAN ] quorum regained, resuming 
activity
Nov 19 12:27:39 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering GATHER state from 11.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Saving state aru 9 high seq 
received 9
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Storing new sequence id for 
ring e4
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering COMMIT state.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] entering RECOVERY state.
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] position [0] member 
192.168.10.10:
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] previous ring seq 224 rep 
192.168.10.10
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] aru 1a high delivered 1a 
received flag 1
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] position [1] member 
192.168.10.11:
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] previous ring seq 224 rep 
192.168.10.11
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] aru 9 high delivered 9 
received flag 1
Nov 19 12:27:39 l2 openais[1977]: [TOTEM] Did not need to originate any 
messages in recovery.
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:27:40 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:27:40 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:27:40 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.10
Nov 19 12:27:40 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:27:40 l2 openais[1977]: [CPG  ] got joinlist message from node 2
Nov 19 12:27:40 l2 ccsd[1941]: Initial status:: Quorate
[...]
Nov 19 12:28:37 l2 kernel: dlm: Using TCP for communications
Nov 19 12:28:37 l2 kernel: dlm: connecting to 2
Nov 19 12:28:38 l2 clurgmgrd[2687]: <notice> Resource Group Manager 
Starting
Nov 19 12:28:38 l2 kernel: dlm: got connection from 2
[...]
Nov 19 12:28:47 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd stop
Nov 19 12:28:47 l2 vsftpd: script param: stop
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] The token was lost in the 
OPERATIONAL state.
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] Receive multicast socket recv 
buffer size (288000 bytes).
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] Transmit multicast socket send 
buffer size (219136 bytes).
Nov 19 12:30:27 l2 openais[1977]: [TOTEM] entering GATHER state from 2.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering GATHER state from 0.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Creating commit token because 
I am the rep.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Saving state aru 28 high seq 
received 28
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Storing new sequence id for 
ring e8
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering COMMIT state.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering RECOVERY state.
Nov 19 12:30:32 l2 fenced[1993]: l1.local not a cluster member after 0 
sec post_fail_delay
Nov 19 12:30:32 l2 kernel: dlm: closing connection to node 2
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] position [0] member 
192.168.10.11:
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] previous ring seq 228 rep 
192.168.10.10
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] aru 28 high delivered 28 
received flag 1
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Did not need to originate any 
messages in recovery.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] Sending initial ORF token
Nov 19 12:30:32 l2 fenced[1993]: fencing node "l1.local"
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:30:32 l2 fence_manual: Node l1.local needs to be reset before 
recovery can procede.  Waiting for l1.local to rejoin the cluster or for 
manual acknowledgement that it has been reset (i.e. fence_ack_manual -n 
l1.local)
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:30:32 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:30:32 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:30:32 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:30:32 l2 openais[1977]: [CPG  ] got joinlist message from node 1
Nov 19 12:30:52 l2 fenced[1993]: fence "l1.local" success
Nov 19 12:30:58 l2 clurgmgrd[2687]: <notice> Taking over service 
service:vsftpd from down member l1.local
Nov 19 12:30:58 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd start
Nov 19 12:30:58 l2 vsftpd: script param: start
Nov 19 12:30:59 l2 clurgmgrd[2687]: <notice> Service service:vsftpd started
Nov 19 12:31:07 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:31:07 l2 vsftpd: script param: status
Nov 19 12:31:37 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:31:37 l2 vsftpd: script param: status
Nov 19 12:32:07 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:32:07 l2 vsftpd: script param: status
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] entering GATHER state from 11.
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] Saving state aru 18 high seq 
received 18
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] Storing new sequence id for 
ring ec
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] entering COMMIT state.
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] entering RECOVERY state.
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] position [0] member 
192.168.10.10:
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] previous ring seq 232 rep 
192.168.10.10
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] aru 9 high delivered 8 
received flag 1
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] position [1] member 
192.168.10.11:
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] previous ring seq 232 rep 
192.168.10.11
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] aru 18 high delivered 18 
received flag 1
Nov 19 12:32:25 l2 openais[1977]: [TOTEM] Did not need to originate any 
messages in recovery.
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] CLM CONFIGURATION CHANGE
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] New Configuration:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.11) 
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Left:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ] Members Joined:
Nov 19 12:32:25 l2 openais[1977]: [CLM  ]     r(0) ip(192.168.10.10) 
Nov 19 12:32:26 l2 openais[1977]: [SYNC ] This node is within the 
primary component and will provide service.
Nov 19 12:32:26 l2 openais[1977]: [TOTEM] entering OPERATIONAL state.
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.10
Nov 19 12:32:26 l2 openais[1977]: [CLM  ] got nodejoin message 
192.168.10.11
Nov 19 12:32:26 l2 openais[1977]: [CPG  ] got joinlist message from node 1
Nov 19 12:32:26 l2 clurgmgrd[2687]: <notice> Stopping service 
service:vsftpd
Nov 19 12:32:41 l2 clurgmgrd[2687]: <err> #52: Failed changing RG status
Nov 19 12:32:56 l2 clurgmgrd[2687]: <err> #57: Failed changing RG status
Nov 19 12:32:57 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:32:57 l2 vsftpd: script param: status
Nov 19 12:33:20 l2 kernel: dlm: connecting to 2
Nov 19 12:33:20 l2 kernel: dlm: got connection from 2
Nov 19 12:33:36 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status
Nov 19 12:33:36 l2 vsftpd: script param: status
Nov 19 12:34:06 l2 clurgmgrd: [2687]: <info> Executing 
/etc/init.d/vsftpd status

daro

-------------- next part --------------
A non-text attachment was scrubbed...
Name: d.skorupa.vcf
Type: text/x-vcard
Size: 262 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20071119/4f17467c/attachment.vcf>


More information about the Linux-cluster mailing list