[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] fence_ilo + HP ProLiant DL580 G5



Hi cluster men,

we are in the process of building a cluster to virtualization a lot
of  low-end servers using xen. Our plan is to use rhcs and clvm
for this but iLO insists on not working...  :-|

The cluster has 2 nodes, two HP ProLiant DL580 G5 (x86_64).
We're using multi-vlan access to reach a lot of networks
and EMC symmetrix more multipath to share the disks. Well,
everything is ok except when I need to use iLO to provide
one secure way for ha. Follows my cluster.conf:

<?xml version="1.0"?>
<cluster name="alpha" config_version="3">
<cman two_node="0" expected_votes="3"/>
<clusternodes>
       <clusternode name="node1.ha" votes="1" nodeid="1"/>
              <fence>
                   <method name="1">
                       <device name="ilo-node1"/>
                    </method>
                    <method name="2">
                       <device name="manual" nodename="node1.ha"/>
                    </method>
               </fence>
       <clusternode name="node2.ha" votes="1" nodeid="2"/>
              <fence>
                   <method name="1">
                       <device name="ilo-node2"/>
                    </method>
                    <method name="2">
                       <device name="manual" nodename="node2.ha"/>
                    </method>
              </fence>
</clusternodes>

<fencedevices>
<fencedevice agent="fence_ilo" hostname="10.127.255.129" login="Administrator" name="ilo-node1" passwd="xxxx"/> <fencedevice agent="fence_ilo" hostname="10.127.255.130" login="Administrator" name="ilo-node2" passwd="xxxx"/>
 <fencedevice agent="fence_manual" name="manual"/>
</fencedevices>

<quorumd device="/dev/mapper/3600604800002877515624d4630383434p1" tko="10" votes="1" log_facility="local6" log_level="7" min_score="1" interval="1"> <heuristic interval="4" tko="3" program="ping -c1 -t3 10.10.10.1" score="1"/> <heuristic interval="4" tko="3" program="ping -c1 -t3 10.10.10.2" score="1"/>
</quorumd>

<rm log_facility="local5" log_level="7">
<failoverdomains>
<failoverdomain name="para_dom" nofailback="1" ordered="1" restricted="0">
       <failoverdomainnode name="node1.ha" priority="1"/>
       <failoverdomainnode name="node2.ha" priority="2"/>
    </failoverdomain>
<failoverdomain name="hvm_dom" nofailback="1" ordered="1" restricted="0">
       <failoverdomainnode name="node1.ha" priority="2"/>
       <failoverdomainnode name="node2.ha" priority="1"/>
    </failoverdomain>
</failoverdomains>
<resources/>

<vm autostart="1" domain="para_dom" exclusive="0" migrate="live" name="rh52-para-virt01" path="/etc/xen"/> <vm autostart="1" domain="hvm_dom" exclusive="0" migrate="live" name="w2003-vm01" path="/etc/xen"/>
</rm>
</cluster>

node1# clustat
Cluster Status for alpha @ Sun Oct 26 21:32:52 2008
Member Status: Quorate

Member Name                                                    ID   Status
------ ----                                                    ---- ------
node1.ha 1 Online, Local, rgmanager node2.ha 2 Online, rgmanager /dev/mapper/3600604800002877515624d4630383434p1 0 Online, Quorum Disk

Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:rh52-para-virt01 node1.ha started vm:w2003-vm01 node2.ha started


Look, when I try to fence the another node it doesn't works.
node1# fence_node node2.ha
node1# echo $?
1
node1# tail -1 /var/log/messages
Oct 26 21:44:44 xxxxx fence_node[1480]: Fence of "node2.ha" was unsuccessful

But if I try to fence via agent it works fine.
node1# ./fence_ilo  -o off  -l Administrator -p xxxx -a 10.127.255.130
success
echo $?
0

# clustat
Cluster Status for alpha @ Sun Oct 26 21:56:36 2008
Member Status: Quorate

Member Name                                                    ID   Status
------ ----                                                    ---- ------
node1.ha 1 Online, Local, rgmanager node2.ha 2 Offline /dev/mapper/3600604800002877515624d4630383434p1 0 Online, Quorum Disk

Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:rh52-para-virt01 node1.ha started vm:w2003-vm01 node2.ha started

Now node2 is offline but the service remains there,
that is, node1 doesn't take over the vm:w2003-vm01
from node2. Follow the messages.log.
node1# tail -50 /var/log/messages
Oct 26 21:44:44 xxxxx fence_node[1480]: Fence of "node2.ha" was unsuccessful
Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Oct 26 21:54:49 xxxxx openais[31517]: [TOTEM] entering GATHER state from 2.
Oct 26 21:54:50 xxxxx qdiskd[31565]: <notice> Writing eviction notice for node 2
Oct 26 21:54:51 xxxxx qdiskd[31565]: <notice> Node 2 evicted
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering GATHER state from 0.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Creating commit token because I am the rep. Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Saving state aru 75 high seq received 75 Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Storing new sequence id for ring 14ac
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering COMMIT state.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering RECOVERY state.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] position [0] member 10.127.255.137: Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] previous ring seq 5288 rep 10.127.255.137 Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] aru 75 high delivered 75 received flag 1 Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Did not need to originate any messages in recovery.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] Sending initial ORF token
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] New Configuration:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ]  r(0) ip(10.127.255.137)
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] Members Left:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ]  r(0) ip(10.127.255.138)
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] Members Joined:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] New Configuration:
Oct 26 21:54:54 xxxxx clurgmgrd[31715]: <info> State change: node2.ha DOWN
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ]  r(0) ip(10.127.255.137)
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] Members Left:
Oct 26 21:54:54 xxxxx openais[31517]: [CLM  ] Members Joined:
Oct 26 21:54:54 xxxxx openais[31517]: [SYNC ] This node is within the primary component and will provide service.
Oct 26 21:54:54 xxxxx openais[31517]: [TOTEM] entering OPERATIONAL state.
Oct 26 21:54:54 xxxxx openais[31517]: [CLM ] got nodejoin message 10.127.255.137 Oct 26 21:54:54 xxxxx openais[31517]: [CPG ] got joinlist message from node 1
Oct 26 21:54:54 xxxxx kernel: dlm: closing connection to node 2
Oct 26 21:54:54 xxxxx fenced[31533]: node2.ha not a cluster member after 0 sec post_fail_delay
Oct 26 21:54:54 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:54:54 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:54:59 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:54:59 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:04 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:04 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:09 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:09 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:14 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:14 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:19 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 21:55:19 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 21:55:24 xxxxx fenced[31533]: fencing node "node2.ha"

Until I to force via fenced_override
node1# echo node2.ha  > /var/run/cluster/fenced_override
tail -1 /var/log/messages
Oct 26 22:05:08 xxxxx clurgmgrd[31715]: <notice> Taking over service vm:w2003-vm01 from down member node2.ha

Another example, if I simply to put the iface of heartbeat to
off on node2 (for simulate the problem), the same thing happens.
node2#  ifconfig  eth1 down
node1# tail -50 /var/log/messages
Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] The token was lost in the OPERATIONAL state. Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] Receive multicast socket recv buffer size (288000 bytes). Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).
Oct 26 23:39:07 xxxxx openais[31517]: [TOTEM] entering GATHER state from 2.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] entering GATHER state from 0.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Creating commit token because I am the rep. Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Saving state aru 52 high seq received 52 Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Storing new sequence id for ring 14b4
Oct 26 23:39:12 xxxxx  openais[31517]: [TOTEM] entering COMMIT state.
Oct 26 23:39:12 xxxxx  openais[31517]: [TOTEM] entering RECOVERY state.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] position [0] member 10.127.255.137: Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] previous ring seq 5296 rep 10.127.255.137 Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] aru 52 high delivered 52 received flag 1 Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] Did not need to originate any messages in recovery.
Oct 26 23:39:12 xxxxx  openais[31517]: [TOTEM] Sending initial ORF token
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] New Configuration:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ]  r(0) ip(10.127.255.137)
Oct 26 23:39:12 xxxxx clurgmgrd[31715]: <info> State change: node2.ha DOWN
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] Members Left:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ]  r(0) ip(10.127.255.138)
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] Members Joined:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] CLM CONFIGURATION CHANGE
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] New Configuration:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ]  r(0) ip(10.127.255.137)
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] Members Left:
Oct 26 23:39:12 xxxxx openais[31517]: [CLM  ] Members Joined:
Oct 26 23:39:12 xxxxx openais[31517]: [SYNC ] This node is within the primary component and will provide service.
Oct 26 23:39:12 xxxxx openais[31517]: [TOTEM] entering OPERATIONAL state.
Oct 26 23:39:12 xxxxx openais[31517]: [CLM ] got nodejoin message 10.127.255.137 Oct 26 23:39:12 xxxxx openais[31517]: [CPG ] got joinlist message from node 1
Oct 26 23:39:12 xxxxx kernel: dlm: closing connection to node 2
Oct 26 23:39:12 xxxxx fenced[31533]: node2.ha not a cluster member after 0 sec post_fail_delay
Oct 26 23:39:12 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 23:39:12 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 23:39:17 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 23:39:17 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 23:39:22 xxxxx fenced[31533]: fencing node "node2.ha"
Oct 26 23:39:22 xxxxx fenced[31533]: fence "node2.ha" failed
Oct 26 23:39:27 xxxxx fenced[31533]: fencing node "node2.ha"

node1# clustat
Cluster Status for alpha @ Sun Oct 26 23:41:20 2008
Member Status: Quorate

Member Name                                                    ID   Status
------ ----                                                    ---- ------
node1.ha 1 Online, Local, rgmanager node2.ha 2 Offline /dev/mapper/3600604800002877515624d4630383434p1 0 Online, Quorum Disk

Service Name Owner (Last) State ------- ---- ----- ------ ----- vm:rh52-para-virt01 node1.ha started vm:w2003-vm01 node2.ha started

I believe that node1 had power off node2 via iLO because node2
don't responded anymore but node1 didn't take over the service
like it should to do.

Finally for try to solve this problem I loaded these modules on both nodes
from hp-OpenIPMI-8.1.0-104.rhel5.rpm package but nothing changed.
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_devintf.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_msghandler.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_poweroff.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_si.ko
/opt/hp/hp-OpenIPMI/bin/2.6.18-92.el5xen/ipmi_watchdog.ko


ps. I'm using one rh5.2, kernel-2.6.18-92.el5, cman-2.0.84-2.el5,
rgmanager-2.0.38-2.el5 and iLO 1.50 on HPs.


tks a lot.

--
Renan



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]