[Linux-cluster] Problem with machines fencing one another in 2 Node NFS cluster

Thu Feb 24 13:34:04 UTC 2011

Thanks for the response.  Sorry for the delay.  I had an issue that, 
unexpectedly, took me away from the office.  I am just getting back to 
this now.

Yes, the MAC addresses were all updated after the cloning.  According to 
my notes, here are sections of the log files at the time of a fence from 
each cluster node.

Feb 10 15:17:48 nfs2-cluster clurgmgrd[4280]:<notice>  Resource Group Manager Starting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Shutting down Cluster Service Manager...
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutting down
Feb 10 15:18:17 nfs2-cluster clurgmgrd[4280]:<notice>  Shutdown complete, exiting
Feb 10 15:18:17 nfs2-cluster rgmanager: [7580]:<notice>  Cluster Service Manager is stopped.
Feb 10 15:18:23 nfs2-cluster ccsd[2989]: Stopping ccsd, SIGTERM received.
Feb 10 15:18:23 nfs2-cluster NAMC
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading all openais components
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_confdb v0 (19/10)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cpg v0 (18/8)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cfg v0 (17/7)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_msg v0 (16/6)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_lck v0 (15/5)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_evt v0 (14/4)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_ckpt v0 (13/3)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_amf v0 (12/2)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_clm v0 (11/1)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_evs v0 (10/0)
Feb 10 15:18:23 nfs2-cluster openais[3045]: [SERV ] Unloading openais component: openais_cman v0 (9/9)
Feb 10 15:18:23 nfs2-cluster gfs_controld[3077]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster dlm_controld[3071]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster fenced[3065]: cluster is down, exiting
Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 2
Feb 10 15:18:23 nfs2-cluster kernel: dlm: closing connection to node 1

Feb 10 15:17:34 nfs1-cluster ntpd[3765]: synchronized to LOCAL(0), stratum 10
Feb 10 15:18:17 nfs1-cluster clurgmgrd[4323]:<notice>  Member 2 shutting down
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] The token was lost in the OPERATIONAL state.
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Feb 10 15:18:33 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 2.
Feb 10 15:18:34 nfs1-cluster ntpd[3765]: synchronized to 132.236.56.250, stratum 2
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering GATHER state from 0.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Creating commit token because I am the rep.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Saving state aru 230 high seq received 230
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Storing new sequence id for ring 1f80
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering COMMIT state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering RECOVERY state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] position [0] member 140.90.91.240:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] previous ring seq 8060 rep 140.90.91.240
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] aru 230 high delivered 230 received flag 1
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Did not need to originate any messages in recovery.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] Sending initial ORF token
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] New Configuration:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0) ip(140.90.91.240)
Feb 10 15:18:35 nfs1-cluster kernel: dlm: closing connection to node 2
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Left:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0) ip(140.90.91.242)
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Joined:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] CLM CONFIGURATION CHANGE
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] New Configuration:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ]     r(0) ip(140.90.91.240)
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Left:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] Members Joined:
Feb 10 15:18:35 nfs1-cluster openais[3046]: [SYNC ] This node is within the primary component and will provide service.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [TOTEM] entering OPERATIONAL state.
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CLM  ] got nodejoin message 140.90.91.240
Feb 10 15:18:35 nfs1-cluster openais[3046]: [CPG  ] got joinlist message from node 1

I was seeing a number of these messages but they stopped after upgrading openais

nfs2-cluster openais[3012]: [TOTEM] Retransmit List: 1df3

Yes, these are in managed switches.  I will try to run the tcpdump asap.  Unfortunately, that means I have to have it crash again to get what I need and my users are already annoyed by the downtime we've had.  I know this isn't the best solution for our needs, but given the lack of funding, this seemed like a good idea at the time.

Thanks for the help!

Randy

On 02/14/2011 09:03 AM, Digimer wrote:
> On 02/14/2011 08:53 AM, Randy Brown wrote:
>> Hello,
>>
>> I am running a 2 node cluster being used as a NAS head for a Lefthand
>> Networks iSCSI SAN to provide NFS mounts out to my network.  Things have
>> been OK for a while, but I recently lost one of the nodes as a result of
>> a patching problem.  In an effort to recreate the failed node, I imaged
>> the working node and installed that image on the failed node.  I set
>> it's hostname and IP settings correctly and the machine booted and
>> joined the cluster just fine.  Or at least it appeared so.  Things ran
>> OK for the last few weeks, but I recently started seeing a behavior
>> where the nodes start fencing each other.  I'm wondering if there is
>> something as a result of cloning the nodes that could be the problem.
>> Possibly something that should be different but isn't because of the
>> cloning?
>>
>> I am running CentOS 5.5 with the following package versions:
>>
>> Kernel - 2.6.18-194.11.3.el5 #1 SMP
>> cman-2.0.115-34.el5_5.4
>> lvm2-cluster-2.02.56-7.el5_5.4
>> gfs2-utils-0.1.62-20.el5
>> kmod-gfs-0.1.34-12.el5.centos
>> rgmanager-2.0.52-6.el5.centos.8
>>
>> I have a Qlogic qla4062 HBA in the node running: QLogic iSCSI HBA Driver
>> (f8b83000) v5.01.03.04
>>
>> I will gladly provide more information as needed.
>>
>> Thank you,
>> Randy
> Silly question, but are the NICs mapped to their MAC addresses? If so,
> did you update the MAC addresses after cloning the server to reflect the
> actual MAC addresses? Assuming so, do you have managed switches? If so,
> can you test by swapping out a simple, unmanaged switch?
>
> This sounds like a multicast issue at some level. Fencing happens once
> the totem ring is declared failed. Do you see anything interesting in
> the log files prior to the fence? Can you run tcpdump to see what is
> happening on the interface(s) prior to the fence?
>
-------------- next part --------------
A non-text attachment was scrubbed...
Name: randy_brown.vcf
Type: text/x-vcard
Size: 313 bytes
Desc: not available
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110224/5063820c/attachment.vcf>