[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] RHEL5 GFS2 - 2 node - node fenced when writing



Hello,

Installed RHEL5 on a new two node cluster with Shared FC storage.  The two shared storage boxes are each split into 6.9TB LUNs for a total of 4 - 6.9TB LUNS.  Each machine is connected via a single 100Mb connection to a switch and a single FC connection to a FC switch.

The 4 LUNs have LVM on them with GFS2.  The file systems are mountable from each box.  When performing a script dd write of zeros in 250MB file sizes to the file system from each box to different LUNS, one of the nodes in the cluster is fenced by the other one.  File size does not seem to matter.

My first guess at the problem was the heartbeat timeout in openais.  In the cluster.conf below I added the totem line to hopefully raise the timeout to 10 seconds.  This however did not resolve the problem.  Both boxes are running the latest updates as of 2 days ago from up2date.

Below is the cluster.conf and what is seen in the logs.  Any suggestions would be greatly appreciated.

Thanks!

Neal



##########################################

Cluster.conf

##########################################


<?xml version="1.0"?>
<cluster alias="storage1" config_version="4" name="storage1">
        <fence_daemon post_fail_delay="0" post_join_delay="3"/>
        <clusternodes>
                <clusternode name="fu1" nodeid="1" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="apc4" port="1" switch="1"/>
                                </method>
                        </fence>
                        <multicast addr=" 224.10.10.10" interface="eth0"/>
                </clusternode>
                <clusternode name="fu2" nodeid="2" votes="1">
                        <fence>
                                <method name="1">
                                        <device name="apc4" port="2" switch="1"/>
                                </method>
                        </fence>
                        <multicast addr="224.10.10.10" interface="eth0"/>
                </clusternode>
        </clusternodes>
        <cman expected_votes="1" two_node="1">
                <multicast addr=" 224.10.10.10"/>
                <totem token="10000"/>
        </cman>
        <fencedevices>
                <fencedevice agent="fence_apc" ipaddr=" 192.168.14.193" login="apc" name="apc4" passwd="apc"/>
        </fencedevices>
        <rm>
                <failoverdomains/>
                <resources/>
        </rm>
</cluster>


#####################################################

/var/log/messages

#####################################################

Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] The token was lost in the OPERATIONAL state.
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Receive multicast socket recv buffer size (262142 bytes).
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Jun  5 20:19:30 fu1 openais[5351]: [TOTEM] entering GATHER state from 2.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering GATHER state from 0.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Creating commit token because I am the rep.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Saving state aru 6e high seq received 6e
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering COMMIT state.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering RECOVERY state.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] position [0] member 192.168.14.195:
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] previous ring seq 16 rep 192.168.14.195
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] aru 6e high delivered 6e received flag 0
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Did not need to originate any messages in recovery.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Storing new sequence id for ring 14
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] Sending initial ORF token
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
Jun  5 20:19:34 fu1 kernel: dlm: closing connection to node 2
Jun  5 20:19:34 fu1 fenced[5367]: fu2 not a cluster member after 0 sec post_fail_delay
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.195)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
Jun  5 20:19:34 fu1 fenced[5367]: fencing node "fu2"
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip(192.168.14.197)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the primary component and will provide service.
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] CLM CONFIGURATION CHANGE
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] New Configuration:
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ]      r(0) ip( 192.168.14.195)
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Left:
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] Members Joined:
Jun  5 20:19:34 fu1 openais[5351]: [SYNC ] This node is within the primary component and will provide service.
Jun  5 20:19:34 fu1 openais[5351]: [TOTEM] entering OPERATIONAL state.
Jun  5 20:19:34 fu1 openais[5351]: [CLM  ] got nodejoin message 192.168.14.195
Jun  5 20:19:34 fu1 openais[5351]: [CPG  ] got joinlist message from node 1
Jun  5 20:19:36 fu1 fenced[5367]: fence "fu2" success
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Trying to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Trying to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Looking at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Trying to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Trying to acquire journal lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Looking at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Looking at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Looking at journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Acquiring the transaction lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Replaying journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Replayed 0 of 0 blocks
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Found 0 revoke tags
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Journal replayed in 1s
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:003vg_gfs.0: jid=1: Done
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Acquiring the transaction lock...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Replaying journal...
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Replayed 0 of 0 blocks
Jun  5 20:19:41 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Found 0 revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Journal replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:002vg_gfs.0: jid=1: Done
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Acquiring the transaction lock...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Acquiring the transaction lock...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Replaying journal...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Replayed 222 of 223 blocks
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Found 1 revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Journal replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:004vg_gfs.0: jid=1: Done
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Replaying journal...
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Replayed 438 of 439 blocks
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Found 1 revoke tags
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Journal replayed in 1s
Jun  5 20:19:42 fu1 kernel: GFS2: fsid=storage1:001vg_gfs.0: jid=1: Done



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]