[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] fencing problem



Hello,

I know somebody else seeing this problem as well.  Its a problem with ccsd,
fenced goes to do a ccs_get() to get the next fence method and it fails because
ccsd doesn't have an open connection struct to processes that request.  I'm
getting ready to build a debug ccs package for the other individual experiencing
this problem, would you be willing to run it as well and provide feedback?
Thank you,

Josef

On Thu, Dec 14, 2006 at 03:19:21PM +0000, Marcos David wrote:
> Hello,
> I still need help with this one ;)
> 
> help! please!
> 
> Thanks.
> 
> Marcos David wrote:
> >hello,
> >I'm experiencing some problems with cluster fencing.
> >First lets start with the specs:
> >
> >it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4
> >
> >the machines both have ILOM device that acts as a first level of fencing.
> >then there is a second level of fencing that is performed by an UPS.
> >
> >my problem is the following:
> >if i shutdown one of the nodes (simulating a power failure) the other 
> >tries to fence the failed node. So far so good.
> >The problem is that since the ILOM in the node is offline the second 
> >node keeps trying to fence the ILOM device and never gives up!
> >
> >According to what I've read on the FAQ about fencing levels, if the 
> >first level fails it should go to the second level, and so on...
> >
> >But it never does this!
> >
> >Here a copy of th /var/log/messages:
> >
> >Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the 
> >cluster : Missed too many heartbeats
> >Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after 
> >0 sec post_fail_delay
> >Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
> >Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: 
> >Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect 
> >after 30 seconds Failed
> >Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection 
> >descriptor received.
> >Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid 
> >request descriptor
> >Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
> >Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"
> >
> >the last 4 lines repeat for ever....
> >
> >here is a copy of the cluster.conf
> >
> >
> ><?xml version="1.0"?>
> ><cluster config_version="19" name="SERVER-A">
> >       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >       <clusternodes>
> >               <clusternode name="node-a" votes="1">
> >                       <fence>
> >                               <method name="1">
> >                                       <device name="fence_node-a"/>
> >                               </method>
> >                               <method name="2">
> >                                       <device name="UPS_node-a"/>
> >                               </method>
> >                       </fence>
> >               </clusternode>
> >               <clusternode name="node-b" votes="1">
> >                       <fence>
> >                               <method name="1">
> >                                       <device name="fence_node-b"/>
> >                               </method>
> >                               <method name="2">
> >                                       <device name="UPS_node-b"/>
> >                               </method>
> >                       </fence>
> >               </clusternode>
> >       </clusternodes>
> >       <cman expected_votes="1" two_node="1"/>
> >       <fencedevices>
> >               <fencedevice agent="fence_ipmilan" auth="password" 
> >ipaddr="172.18.57.17" login="root" name="fence_node-a" 
> >passwd="changeme"/>
> >               <fencedevice agent="fence_ipmilan" auth="password" 
> >ipaddr="172.18.57.18" login="root" name="fence_node-b" 
> >passwd="changeme"/>
> >               <fencedevice agent="fence_apc" ipaddr="172.18.57.20" 
> >login="power" name="UPS_node-a" passwd="power"/>
> >               <fencedevice agent="fence_apc" ipaddr="172.18.57.21" 
> >login="power" name="UPS_node-b" passwd="power"/>
> >
> >       </fencedevices>
> >       <rm>
> >               <failoverdomains>
> >                       <failoverdomain name="Cluster_0" ordered="1" 
> >restricted="0">
> >                               <failoverdomainnode name="node-a" 
> >priority="1"/>
> >                               <failoverdomainnode name="node-b" 
> >priority="1"/>
> >                       </failoverdomain>
> >               </failoverdomains>
> >               <resources>
> >                       <fs device="/dev/sdb1" force_fsck="1" 
> >force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" 
> >name="Storedge_Shared" options="" self_fence="1"/>
> >                       <ip address="172.18.57.16" monitor_link="1"/>
> >                       <ip address="172.18.57.11" monitor_link="1"/>
> >                       <ip address="172.18.57.14" monitor_link="1"/>
> >               </resources>
> >               <service autostart="1" domain="Cluster_0" 
> >name="postgresql">
> >                       <ip ref="172.18.57.16">
> >                               <fs ref="Storedge_Shared">
> >                                       <script 
> >file="/etc/init.d/postgresql" 
> >name="PostgreSQL">                                             
> >                               </fs>
> >                       </ip>
> >               </service>
> >               <service autostart="1" domain="Cluster_0" name="afs">
> >                       <ip ref="172.18.57.14">
> >                               <script file="/etc/init.d/afs" 
> >name="AFS"/>
> >                       </ip>
> >               </service>
> >       </rm>
> ></cluster>
> >
> >I would like to know a way to solve this problem.... :-)
> >
> >Thanks in advance,
> >
> >Marcos David
> >
> >
> >
> >
> >-- 
> >Linux-cluster mailing list
> >Linux-cluster redhat com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]