[Linux-cluster] fencing problem

Thu Dec 14 15:27:16 UTC 2006

Sure.
If you can get me the package, I can install it on our testing 
environment and provide you with the results.

Greets,
Marcos David

Josef Whiter wrote:
> Hello,
>
> I know somebody else seeing this problem as well.  Its a problem with ccsd,
> fenced goes to do a ccs_get() to get the next fence method and it fails because
> ccsd doesn't have an open connection struct to processes that request.  I'm
> getting ready to build a debug ccs package for the other individual experiencing
> this problem, would you be willing to run it as well and provide feedback?
> Thank you,
>
> Josef
>
> On Thu, Dec 14, 2006 at 03:19:21PM +0000, Marcos David wrote:
>   
>> Hello,
>> I still need help with this one ;)
>>
>> help! please!
>>
>> Thanks.
>>
>> Marcos David wrote:
>>     
>>> hello,
>>> I'm experiencing some problems with cluster fencing.
>>> First lets start with the specs:
>>>
>>> it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4
>>>
>>> the machines both have ILOM device that acts as a first level of fencing.
>>> then there is a second level of fencing that is performed by an UPS.
>>>
>>> my problem is the following:
>>> if i shutdown one of the nodes (simulating a power failure) the other 
>>> tries to fence the failed node. So far so good.
>>> The problem is that since the ILOM in the node is offline the second 
>>> node keeps trying to fence the ILOM device and never gives up!
>>>
>>> According to what I've read on the FAQ about fencing levels, if the 
>>> first level fails it should go to the second level, and so on...
>>>
>>> But it never does this!
>>>
>>> Here a copy of th /var/log/messages:
>>>
>>> Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the 
>>> cluster : Missed too many heartbeats
>>> Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after 
>>> 0 sec post_fail_delay
>>> Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
>>> Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports: 
>>> Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect 
>>> after 30 seconds Failed
>>> Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection 
>>> descriptor received.
>>> Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid 
>>> request descriptor
>>> Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
>>> Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"
>>>
>>> the last 4 lines repeat for ever....
>>>
>>> here is a copy of the cluster.conf
>>>
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="19" name="SERVER-A">
>>>       <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>>       <clusternodes>
>>>               <clusternode name="node-a" votes="1">
>>>                       <fence>
>>>                               <method name="1">
>>>                                       <device name="fence_node-a"/>
>>>                               </method>
>>>                               <method name="2">
>>>                                       <device name="UPS_node-a"/>
>>>                               </method>
>>>                       </fence>
>>>               </clusternode>
>>>               <clusternode name="node-b" votes="1">
>>>                       <fence>
>>>                               <method name="1">
>>>                                       <device name="fence_node-b"/>
>>>                               </method>
>>>                               <method name="2">
>>>                                       <device name="UPS_node-b"/>
>>>                               </method>
>>>                       </fence>
>>>               </clusternode>
>>>       </clusternodes>
>>>       <cman expected_votes="1" two_node="1"/>
>>>       <fencedevices>
>>>               <fencedevice agent="fence_ipmilan" auth="password" 
>>> ipaddr="172.18.57.17" login="root" name="fence_node-a" 
>>> passwd="changeme"/>
>>>               <fencedevice agent="fence_ipmilan" auth="password" 
>>> ipaddr="172.18.57.18" login="root" name="fence_node-b" 
>>> passwd="changeme"/>
>>>               <fencedevice agent="fence_apc" ipaddr="172.18.57.20" 
>>> login="power" name="UPS_node-a" passwd="power"/>
>>>               <fencedevice agent="fence_apc" ipaddr="172.18.57.21" 
>>> login="power" name="UPS_node-b" passwd="power"/>
>>>
>>>       </fencedevices>
>>>       <rm>
>>>               <failoverdomains>
>>>                       <failoverdomain name="Cluster_0" ordered="1" 
>>> restricted="0">
>>>                               <failoverdomainnode name="node-a" 
>>> priority="1"/>
>>>                               <failoverdomainnode name="node-b" 
>>> priority="1"/>
>>>                       </failoverdomain>
>>>               </failoverdomains>
>>>               <resources>
>>>                       <fs device="/dev/sdb1" force_fsck="1" 
>>> force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared" 
>>> name="Storedge_Shared" options="" self_fence="1"/>
>>>                       <ip address="172.18.57.16" monitor_link="1"/>
>>>                       <ip address="172.18.57.11" monitor_link="1"/>
>>>                       <ip address="172.18.57.14" monitor_link="1"/>
>>>               </resources>
>>>               <service autostart="1" domain="Cluster_0" 
>>> name="postgresql">
>>>                       <ip ref="172.18.57.16">
>>>                               <fs ref="Storedge_Shared">
>>>                                       <script 
>>> file="/etc/init.d/postgresql" 
>>> name="PostgreSQL">                                             
>>>                               </fs>
>>>                       </ip>
>>>               </service>
>>>               <service autostart="1" domain="Cluster_0" name="afs">
>>>                       <ip ref="172.18.57.14">
>>>                               <script file="/etc/init.d/afs" 
>>> name="AFS"/>
>>>                       </ip>
>>>               </service>
>>>       </rm>
>>> </cluster>
>>>
>>> I would like to know a way to solve this problem.... :-)
>>>
>>> Thanks in advance,
>>>
>>> Marcos David
>>>
>>>
>>>
>>>
>>> -- 
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>       
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>     
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>