[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Service Recovery Failure



Hi all,

 

I just performed a test which fail miserably. I have two nodes node-1 and node-2

Global file system /gfs is on node-1.

 

Two HA services running on node-1. If I unplug the cables for node 1 then those two services should transfers to Node-2. But node-2 did not take over the services.

But if I do proper shutdown/reboot on node-1 then those two services are transferring to  node-2 without problem.

 

Please Help!

 

clustat from node-2 before unplug of cable for node-1:

 

[root Node-2 ~]# clustat

Member Status: Quorate

 

  Member Name                                ID           Status

  ------ ----                                             ----        ------

  Node-1                                               1              Online, rgmanager

  Node-2                                               2              Online, Local, rgmanager

 

  Service Name                   Owner (Last)             State        

  ------- ----                             ----- ------                  -----        

  service:nfs                        Node-1                       started        

  service:ESS_HA               Node-1                       started        

 

clustat from node-2 After unplug of cable for node-1:

 

[root Node-2 ~]# clustat

Member Status: Quorate

 

  Member Name                                ID           Status

  ------ ----                                             ----        ------

  Node-1                                               1              Offline

  Node-2                                               2              Online, Local, rgmanager

 

  Service Name                   Owner (Last)            State        

  ------- ----                             ----- ------                  -----        

  service:nfs                        Node-1                       started        

  service:ESS_HA               Node-1                       started        

 

 
/etc/cluster/cluster.conf:

 

[root Node-2 ~]# cat /etc/cluster/cluster.conf

<?xml version="1.0"?>

<cluster config_version="54" name="idm_cluster">

        <fence_daemon post_fail_delay="0" post_join_delay="120"/>

        <clusternodes>

                <clusternode name="Node-1" nodeid="1" votes="1">

                        <fence/>

                </clusternode>

                <clusternode name="Node-2" nodeid="2" votes="1">

                        <fence/>

                </clusternode>

        </clusternodes>

        <cman expected_votes="1" two_node="1"/>

        <fencedevices/>

        <rm>

                <failoverdomains>

                        <failoverdomain name="nfs" ordered="0" restricted="1">

                                <failoverdomainnode name="Node-1" priority="1"/>

                                <failoverdomainnode name="Node-2" priority="1"/>

                        </failoverdomain>

                </failoverdomains>

                <resources>

                        <clusterfs device="/dev/vg00/mygfs" force_unmount="0" fsid="59408" fstype="gfs" mountpoint="/gfs" name="gfs" options=""/>

                        <ip address="10.128.107.229" monitor_link="1"/>

                        <script file="/gfs/ess_clus/HA/clusTest.sh" name="ESS_HA_test"/>

                        <script file="/gfs/clusTest.sh" name="Clus_Test"/>

                </resources>

                <service autostart="1" name="nfs">

                        <clusterfs ref="gfs"/>

                        <ip ref="10.128.107.229"/>

                </service>

                <service autostart="1" domain="nfs" name="ESS_HA" recovery="restart">

                       <script ref="ESS_HA_test"/>

                        <clusterfs ref="gfs"/>

                        <ip ref="10.128.107.229"/>

                </service>

        </rm>

</cluster>

[root Node-2 ~]#

 

Node2: tail –f /var/log/message

 

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] CLM CONFIGURATION CHANGE

Jun 29 18:20:49 vm-idm02 fenced[1706]: vm-idm01 not a cluster member after 0 sec post_fail_delay

Jun 29 18:20:49 vm-idm02 kernel: dlm: closing connection to node 1

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] New Configuration:

Jun 29 18:20:49 vm-idm02 fenced[1706]: fencing node "vm-idm01"

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ]         r(0) ip(10.128.107.224) 

Jun 29 18:20:49 vm-idm02 fenced[1706]: fence "vm-idm01" failed

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Left:

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ]         r(0) ip(10.128.107.223) 

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Joined:

Jun 29 18:20:49 vm-idm02 openais[1690]: [SYNC ] This node is within the primary component and will provide service.

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] CLM CONFIGURATION CHANGE

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] New Configuration:

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ]         r(0) ip(10.128.107.224) 

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Left:

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] Members Joined:

Jun 29 18:20:49 vm-idm02 openais[1690]: [SYNC ] This node is within the primary component and will provide service.

Jun 29 18:20:49 vm-idm02 openais[1690]: [TOTEM] entering OPERATIONAL state.

Jun 29 18:20:49 vm-idm02 openais[1690]: [CLM  ] got nodejoin message 10.128.107.224

Jun 29 18:20:49 vm-idm02 openais[1690]: [CPG  ] got joinlist message from node 2

Jun 29 18:20:54 vm-idm02 fenced[1706]: fencing node "Node-1"

Jun 29 18:20:54 vm-idm02 fenced[1706]: fence "Node-1" failed

Jun 29 18:20:59 vm-idm02 fenced[1706]: fencing node "Node-1"

 

Regards,

Rahul


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]