[Linux-cluster] fencing problem
Marcos David
marcos.david at efacec.pt
Thu Dec 14 15:27:16 UTC 2006
Sure.
If you can get me the package, I can install it on our testing
environment and provide you with the results.
Greets,
Marcos David
Josef Whiter wrote:
> Hello,
>
> I know somebody else seeing this problem as well. Its a problem with ccsd,
> fenced goes to do a ccs_get() to get the next fence method and it fails because
> ccsd doesn't have an open connection struct to processes that request. I'm
> getting ready to build a debug ccs package for the other individual experiencing
> this problem, would you be willing to run it as well and provide feedback?
> Thank you,
>
> Josef
>
> On Thu, Dec 14, 2006 at 03:19:21PM +0000, Marcos David wrote:
>
>> Hello,
>> I still need help with this one ;)
>>
>> help! please!
>>
>> Thanks.
>>
>> Marcos David wrote:
>>
>>> hello,
>>> I'm experiencing some problems with cluster fencing.
>>> First lets start with the specs:
>>>
>>> it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4
>>>
>>> the machines both have ILOM device that acts as a first level of fencing.
>>> then there is a second level of fencing that is performed by an UPS.
>>>
>>> my problem is the following:
>>> if i shutdown one of the nodes (simulating a power failure) the other
>>> tries to fence the failed node. So far so good.
>>> The problem is that since the ILOM in the node is offline the second
>>> node keeps trying to fence the ILOM device and never gives up!
>>>
>>> According to what I've read on the FAQ about fencing levels, if the
>>> first level fails it should go to the second level, and so on...
>>>
>>> But it never does this!
>>>
>>> Here a copy of th /var/log/messages:
>>>
>>> Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the
>>> cluster : Missed too many heartbeats
>>> Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after
>>> 0 sec post_fail_delay
>>> Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
>>> Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports:
>>> Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect
>>> after 30 seconds Failed
>>> Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection
>>> descriptor received.
>>> Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid
>>> request descriptor
>>> Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
>>> Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"
>>>
>>> the last 4 lines repeat for ever....
>>>
>>> here is a copy of the cluster.conf
>>>
>>>
>>> <?xml version="1.0"?>
>>> <cluster config_version="19" name="SERVER-A">
>>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
>>> <clusternodes>
>>> <clusternode name="node-a" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="fence_node-a"/>
>>> </method>
>>> <method name="2">
>>> <device name="UPS_node-a"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> <clusternode name="node-b" votes="1">
>>> <fence>
>>> <method name="1">
>>> <device name="fence_node-b"/>
>>> </method>
>>> <method name="2">
>>> <device name="UPS_node-b"/>
>>> </method>
>>> </fence>
>>> </clusternode>
>>> </clusternodes>
>>> <cman expected_votes="1" two_node="1"/>
>>> <fencedevices>
>>> <fencedevice agent="fence_ipmilan" auth="password"
>>> ipaddr="172.18.57.17" login="root" name="fence_node-a"
>>> passwd="changeme"/>
>>> <fencedevice agent="fence_ipmilan" auth="password"
>>> ipaddr="172.18.57.18" login="root" name="fence_node-b"
>>> passwd="changeme"/>
>>> <fencedevice agent="fence_apc" ipaddr="172.18.57.20"
>>> login="power" name="UPS_node-a" passwd="power"/>
>>> <fencedevice agent="fence_apc" ipaddr="172.18.57.21"
>>> login="power" name="UPS_node-b" passwd="power"/>
>>>
>>> </fencedevices>
>>> <rm>
>>> <failoverdomains>
>>> <failoverdomain name="Cluster_0" ordered="1"
>>> restricted="0">
>>> <failoverdomainnode name="node-a"
>>> priority="1"/>
>>> <failoverdomainnode name="node-b"
>>> priority="1"/>
>>> </failoverdomain>
>>> </failoverdomains>
>>> <resources>
>>> <fs device="/dev/sdb1" force_fsck="1"
>>> force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared"
>>> name="Storedge_Shared" options="" self_fence="1"/>
>>> <ip address="172.18.57.16" monitor_link="1"/>
>>> <ip address="172.18.57.11" monitor_link="1"/>
>>> <ip address="172.18.57.14" monitor_link="1"/>
>>> </resources>
>>> <service autostart="1" domain="Cluster_0"
>>> name="postgresql">
>>> <ip ref="172.18.57.16">
>>> <fs ref="Storedge_Shared">
>>> <script
>>> file="/etc/init.d/postgresql"
>>> name="PostgreSQL">
>>> </fs>
>>> </ip>
>>> </service>
>>> <service autostart="1" domain="Cluster_0" name="afs">
>>> <ip ref="172.18.57.14">
>>> <script file="/etc/init.d/afs"
>>> name="AFS"/>
>>> </ip>
>>> </service>
>>> </rm>
>>> </cluster>
>>>
>>> I would like to know a way to solve this problem.... :-)
>>>
>>> Thanks in advance,
>>>
>>> Marcos David
>>>
>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>>
>> --
>> Linux-cluster mailing list
>> Linux-cluster at redhat.com
>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
More information about the Linux-cluster
mailing list