[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] Missed too many heartbeats




OS: RHEL4 Update 4
Kernel: 2.6.9-42.ELsmp
Cluster: RhCS4 Update4, RHGFS4 U4(GFS-6.1.6-1)
Multipath: EMCpower.LINUX-4.5.1-022
Storage: Fibre channel with EMC CX-320
Fence Device: DELL DRAC5
Service: Postfix, Courier-imap

nodeA.example.com: 192.168.0.20
nodeB.example.com: 192.168.0.60

Drac5(nodeA): 192.168.0.121
Drac5(nodeB); 192.168.0.161


I have 2 node using gfs cluster and powerpath connect through fibre to EMC-CX-320 Storage.
both node use drac5 as fence device
Heartbeat traffice use same interface as normal traffic(Mail,imap/pop3)

Problem is only NodeB alway fenced NodeA with reason "Missed too many heartbeats"

After NodeA was rebooted system can join cluster again and working fine until nodeB start fence again, May be

4-5 hour or 6-7 hour later.

This happen in random manner  2-3 time per day
Memory,Cpu,i/o look good and Traffice not peak during problem have occured (from sar, and mrtg)
no drop, no collision from ifconfig command

In logfile show same messages every time nodeB start fenced NodeA
I try to extend heartbeat interval by change "deadnode_timeout" from 21 to 61 but doesn't help

Have anyway  to solve this problem or enable more debuging ?
Do i have to dedicate network card to separte heartbeat and normal traffic ?



###### /var/log/message
Aug 7 21:50:06 nodeB kernel: CMAN: removing node nodeA.example.com from the cluster : Missed too many

heartbeats
Aug 7 21:50:06 nodeB fenced[20770]: nodeA.example.com not a cluster member after 0 sec post_fail_delay
Aug  7 21:50:06 nodeB fenced[20770]: fencing node "nodeA.example.com"
Aug  7 21:50:15 nodeB fenced[20770]: fence "nodeA.example.com" success
Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Trying to acquire journal lock... Aug 7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Looking at journal...
Aug  7 21:50:22 nodeB kernel: GFS: fsid=bkkair_cluster:gfs01.1: jid=0: Done
Aug  7 21:53:36 nodeB kernel: CMAN: node nodeA.example.com rejoining


###### /etc/cluster/cluster.conf ################

<?xml version="1.0" ?>
<cluster config_version="7" name="bkkair_cluster">
       <fence_daemon post_fail_delay="0" post_join_delay="15"/>
       <clusternodes>
               <clusternode name="nodeA.example.com" votes="1">
                       <fence>
                               <method name="1">
<device modulename="" name="DRAC-nodeA"/>
                               </method>
                       </fence>
               </clusternode>
               <clusternode name="nodeB.example.com" votes="1">
                       <fence>
                               <method name="1">
<device modulename="" name="DRAC-nodeB"/>
                               </method>
                       </fence>
               </clusternode>
       </clusternodes>
       <cman expected_votes="1" two_node="1"/>
        <cman deadnode_timeout="61"/>
       <fencedevices>
<fencedevice agent="fence_drac" ipaddr="192.168.0.121" login="root" name="DRAC-nodeA"

passwd="supervis"/>
<fencedevice agent="fence_drac" ipaddr="192.168.0.161" login="root" name="DRAC-nodeB"

passwd="supervis"/>
       </fencedevices>
       <rm>
               <failoverdomains/>
               <resources/>
       </rm>
</cluster>


#####################################################


Regards,
Nattapon

_________________________________________________________________
FREE pop-up blocking with the new MSN Toolbar - get it now! http://toolbar.msn.click-url.com/go/onm00200415ave/direct/01/


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]