[Linux-cluster] fencing problem

Thu Dec 14 16:02:32 UTC 2006

Hello,

I think I've figured out why this is happening.  I've opened bz219633 in order
to track this issue if you would like to subscribe to it.  I should have
something usefull for you to test today.

Josef

On Thu, Dec 14, 2006 at 03:27:16PM +0000, Marcos David wrote:
> Sure.
> If you can get me the package, I can install it on our testing 
> environment and provide you with the results.
> 
> Greets,
> Marcos David
> 
> 
> Josef Whiter wrote:
> >Hello,
> >
> >I know somebody else seeing this problem as well.  Its a problem with ccsd,
> >fenced goes to do a ccs_get() to get the next fence method and it fails 
> >because
> >ccsd doesn't have an open connection struct to processes that request.  I'm
> >getting ready to build a debug ccs package for the other individual 
> >experiencing
> >this problem, would you be willing to run it as well and provide feedback?
> >Thank you,
> >
> >Josef
> >
> >On Thu, Dec 14, 2006 at 03:19:21PM +0000, Marcos David wrote:
> >  
> >>Hello,
> >>I still need help with this one ;)
> >>
> >>help! please!
> >>
> >>Thanks.
> >>
> >>Marcos David wrote:
> >>    
> >>>hello,
> >>>I'm experiencing some problems with cluster fencing.
> >>>First lets start with the specs:
> >>>
> >>>it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4
> >>>
> >>>the machines both have ILOM device that acts as a first level of fencing.
> >>>then there is a second level of fencing that is performed by an UPS.
> >>>
> >>>my problem is the following:
> >>>if i shutdown one of the nodes (simulating a power failure) the other 
> >>>tries to fence the failed node. So far so good.
> >>>The problem is that since the ILOM in the node is offline the second 
> >>>node keeps trying to fence the ILOM device and never gives up!
> >>>
> >>>According to what I've read on the FAQ about fencing levels, if the 
> >>>first level fails it should go to the second level, and so on...
> >>>
> >>>But it never does this!
> >>>
> >>>Here a copy of th /var/log/messages:
> >>>
> >>>Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the 
> >>>cluster : Missed too many heartbeats
> >>>Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after 
> >>>0 sec post_fail_delay
> >>>Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
> >>>Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: 
> >>>Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect 
> >>>after 30 seconds Failed
> >>>Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection 
> >>>descriptor received.
> >>>Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid 
> >>>request descriptor
> >>>Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
> >>>Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"
> >>>
> >>>the last 4 lines repeat for ever....
> >>>
> >>>here is a copy of the cluster.conf
> >>>
> >>>
> >>><?xml version="1.0"?>
> >>><cluster config_version="19" name="SERVER-A">
> >>>      <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >>>      <clusternodes>
> >>>              <clusternode name="node-a" votes="1">
> >>>                      <fence>
> >>>                              <method name="1">
> >>>                                      <device name="fence_node-a"/>
> >>>                              </method>
> >>>                              <method name="2">
> >>>                                      <device name="UPS_node-a"/>
> >>>                              </method>
> >>>                      </fence>
> >>>              </clusternode>
> >>>              <clusternode name="node-b" votes="1">
> >>>                      <fence>
> >>>                              <method name="1">
> >>>                                      <device name="fence_node-b"/>
> >>>                              </method>
> >>>                              <method name="2">
> >>>                                      <device name="UPS_node-b"/>
> >>>                              </method>
> >>>                      </fence>
> >>>              </clusternode>
> >>>      </clusternodes>
> >>>      <cman expected_votes="1" two_node="1"/>
> >>>      <fencedevices>
> >>>              <fencedevice agent="fence_ipmilan" auth="password" 
> >>>ipaddr="172.18.57.17" login="root" name="fence_node-a" 
> >>>passwd="changeme"/>
> >>>              <fencedevice agent="fence_ipmilan" auth="password" 
> >>>ipaddr="172.18.57.18" login="root" name="fence_node-b" 
> >>>passwd="changeme"/>
> >>>              <fencedevice agent="fence_apc" ipaddr="172.18.57.20" 
> >>>login="power" name="UPS_node-a" passwd="power"/>
> >>>              <fencedevice agent="fence_apc" ipaddr="172.18.57.21" 
> >>>login="power" name="UPS_node-b" passwd="power"/>
> >>>
> >>>      </fencedevices>
> >>>      <rm>
> >>>              <failoverdomains>
> >>>                      <failoverdomain name="Cluster_0" ordered="1" 
> >>>restricted="0">
> >>>                              <failoverdomainnode name="node-a" 
> >>>priority="1"/>
> >>>                              <failoverdomainnode name="node-b" 
> >>>priority="1"/>
> >>>                      </failoverdomain>
> >>>              </failoverdomains>
> >>>              <resources>
> >>>                      <fs device="/dev/sdb1" force_fsck="1" 
> >>>force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" 
> >>>name="Storedge_Shared" options="" self_fence="1"/>
> >>>                      <ip address="172.18.57.16" monitor_link="1"/>
> >>>                      <ip address="172.18.57.11" monitor_link="1"/>
> >>>                      <ip address="172.18.57.14" monitor_link="1"/>
> >>>              </resources>
> >>>              <service autostart="1" domain="Cluster_0" 
> >>>name="postgresql">
> >>>                      <ip ref="172.18.57.16">
> >>>                              <fs ref="Storedge_Shared">
> >>>                                      <script 
> >>>file="/etc/init.d/postgresql" 
> >>>name="PostgreSQL">                                             
> >>>                              </fs>
> >>>                      </ip>
> >>>              </service>
> >>>              <service autostart="1" domain="Cluster_0" name="afs">
> >>>                      <ip ref="172.18.57.14">
> >>>                              <script file="/etc/init.d/afs" 
> >>>name="AFS"/>
> >>>                      </ip>
> >>>              </service>
> >>>      </rm>
> >>></cluster>
> >>>
> >>>I would like to know a way to solve this problem.... :-)
> >>>
> >>>Thanks in advance,
> >>>
> >>>Marcos David
> >>>
> >>>
> >>>
> >>>
> >>>-- 
> >>>Linux-cluster mailing list
> >>>Linux-cluster at redhat.com
> >>>https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>>      
> >>--
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>https://www.redhat.com/mailman/listinfo/linux-cluster
> >>    
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >  
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster