[Linux-cluster] fencing problem
Josef Whiter
jwhiter at redhat.com
Thu Dec 14 16:02:32 UTC 2006
Hello,
I think I've figured out why this is happening. I've opened bz219633 in order
to track this issue if you would like to subscribe to it. I should have
something usefull for you to test today.
Josef
On Thu, Dec 14, 2006 at 03:27:16PM +0000, Marcos David wrote:
> Sure.
> If you can get me the package, I can install it on our testing
> environment and provide you with the results.
>
> Greets,
> Marcos David
>
>
> Josef Whiter wrote:
> >Hello,
> >
> >I know somebody else seeing this problem as well. Its a problem with ccsd,
> >fenced goes to do a ccs_get() to get the next fence method and it fails
> >because
> >ccsd doesn't have an open connection struct to processes that request. I'm
> >getting ready to build a debug ccs package for the other individual
> >experiencing
> >this problem, would you be willing to run it as well and provide feedback?
> >Thank you,
> >
> >Josef
> >
> >On Thu, Dec 14, 2006 at 03:19:21PM +0000, Marcos David wrote:
> >
> >>Hello,
> >>I still need help with this one ;)
> >>
> >>help! please!
> >>
> >>Thanks.
> >>
> >>Marcos David wrote:
> >>
> >>>hello,
> >>>I'm experiencing some problems with cluster fencing.
> >>>First lets start with the specs:
> >>>
> >>>it's two node-cluster (Sun X4100) running RHEL4 Update 4 and RHCS 4
> >>>
> >>>the machines both have ILOM device that acts as a first level of fencing.
> >>>then there is a second level of fencing that is performed by an UPS.
> >>>
> >>>my problem is the following:
> >>>if i shutdown one of the nodes (simulating a power failure) the other
> >>>tries to fence the failed node. So far so good.
> >>>The problem is that since the ILOM in the node is offline the second
> >>>node keeps trying to fence the ILOM device and never gives up!
> >>>
> >>>According to what I've read on the FAQ about fencing levels, if the
> >>>first level fails it should go to the second level, and so on...
> >>>
> >>>But it never does this!
> >>>
> >>>Here a copy of th /var/log/messages:
> >>>
> >>>Dec 11 17:50:28 node_b kernel: CMAN: removing node node_a from the
> >>>cluster : Missed too many heartbeats
> >>>Dec 11 17:50:28 node_b fenced[3240]: node_a not a cluster member after
> >>>0 sec post_fail_delay
> >>>Dec 11 17:50:28 node_b fenced[3240]: fencing node "node_a"
> >>>Dec 11 17:52:47 node_b fenced[3240]: agent "fence_ipmilan" reports:
> >>>Rebooting machine @ IPMI:172.18.56.17...ipmilan: Failed to connect
> >>>after 30 seconds Failed
> >>>Dec 11 17:52:47 node_b ccsd[9390]: process_get: Invalid connection
> >>>descriptor received.
> >>>Dec 11 17:52:47 node_b ccsd[9390]: Error while processing get: Invalid
> >>>request descriptor
> >>>Dec 11 17:52:47 node_b fenced[3240]: fence "node_a" failed
> >>>Dec 11 17:52:52 node_b fenced[3240]: fencing node "node_a"
> >>>
> >>>the last 4 lines repeat for ever....
> >>>
> >>>here is a copy of the cluster.conf
> >>>
> >>>
> >>><?xml version="1.0"?>
> >>><cluster config_version="19" name="SERVER-A">
> >>> <fence_daemon post_fail_delay="0" post_join_delay="3"/>
> >>> <clusternodes>
> >>> <clusternode name="node-a" votes="1">
> >>> <fence>
> >>> <method name="1">
> >>> <device name="fence_node-a"/>
> >>> </method>
> >>> <method name="2">
> >>> <device name="UPS_node-a"/>
> >>> </method>
> >>> </fence>
> >>> </clusternode>
> >>> <clusternode name="node-b" votes="1">
> >>> <fence>
> >>> <method name="1">
> >>> <device name="fence_node-b"/>
> >>> </method>
> >>> <method name="2">
> >>> <device name="UPS_node-b"/>
> >>> </method>
> >>> </fence>
> >>> </clusternode>
> >>> </clusternodes>
> >>> <cman expected_votes="1" two_node="1"/>
> >>> <fencedevices>
> >>> <fencedevice agent="fence_ipmilan" auth="password"
> >>>ipaddr="172.18.57.17" login="root" name="fence_node-a"
> >>>passwd="changeme"/>
> >>> <fencedevice agent="fence_ipmilan" auth="password"
> >>>ipaddr="172.18.57.18" login="root" name="fence_node-b"
> >>>passwd="changeme"/>
> >>> <fencedevice agent="fence_apc" ipaddr="172.18.57.20"
> >>>login="power" name="UPS_node-a" passwd="power"/>
> >>> <fencedevice agent="fence_apc" ipaddr="172.18.57.21"
> >>>login="power" name="UPS_node-b" passwd="power"/>
> >>>
> >>> </fencedevices>
> >>> <rm>
> >>> <failoverdomains>
> >>> <failoverdomain name="Cluster_0" ordered="1"
> >>>restricted="0">
> >>> <failoverdomainnode name="node-a"
> >>>priority="1"/>
> >>> <failoverdomainnode name="node-b"
> >>>priority="1"/>
> >>> </failoverdomain>
> >>> </failoverdomains>
> >>> <resources>
> >>> <fs device="/dev/sdb1" force_fsck="1"
> >>>force_unmount="1" fsid="46144" fstype="ext3" mountpoint="/mnt/shared"
> >>>name="Storedge_Shared" options="" self_fence="1"/>
> >>> <ip address="172.18.57.16" monitor_link="1"/>
> >>> <ip address="172.18.57.11" monitor_link="1"/>
> >>> <ip address="172.18.57.14" monitor_link="1"/>
> >>> </resources>
> >>> <service autostart="1" domain="Cluster_0"
> >>>name="postgresql">
> >>> <ip ref="172.18.57.16">
> >>> <fs ref="Storedge_Shared">
> >>> <script
> >>>file="/etc/init.d/postgresql"
> >>>name="PostgreSQL">
> >>> </fs>
> >>> </ip>
> >>> </service>
> >>> <service autostart="1" domain="Cluster_0" name="afs">
> >>> <ip ref="172.18.57.14">
> >>> <script file="/etc/init.d/afs"
> >>>name="AFS"/>
> >>> </ip>
> >>> </service>
> >>> </rm>
> >>></cluster>
> >>>
> >>>I would like to know a way to solve this problem.... :-)
> >>>
> >>>Thanks in advance,
> >>>
> >>>Marcos David
> >>>
> >>>
> >>>
> >>>
> >>>--
> >>>Linux-cluster mailing list
> >>>Linux-cluster at redhat.com
> >>>https://www.redhat.com/mailman/listinfo/linux-cluster
> >>>
> >>>
> >>--
> >>Linux-cluster mailing list
> >>Linux-cluster at redhat.com
> >>https://www.redhat.com/mailman/listinfo/linux-cluster
> >>
> >
> >--
> >Linux-cluster mailing list
> >Linux-cluster at redhat.com
> >https://www.redhat.com/mailman/listinfo/linux-cluster
> >
> >
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
More information about the Linux-cluster
mailing list