[Linux-cluster] pull plug on node, service never relocates

Corey Kovacs corey.kovacs at gmail.com
Mon May 17 21:56:20 UTC 2010


The service scripts you have in the config above look made up. Are
those some scripts or wrote or are you actually using sys V inits?

Also, can you include a complete log segment? it's quite hard to debug
someone's problem with only partial information.

Really, I don't want to hear  "It should work, I've done everything
right" since clearly something is wrong. I as have many people here
built several if not dozens of these clusters and we are making
suggestions where we have seen the most problems.

It's always funny how the software engineers are never to blame for
there software not working until I prove to them that it's there
fault. Not trying to be a jerk, OR point fingers but open your mind a bit.

Sometimes fencing can be hindered by simply being logged into the device while
the cluster is trying to talk to it. First Gen iLO's and some APC
firmware have problems with this.


Whats the output of ...

cman_tool status
cman_tool nodes
cman_tool services

Are you sure that all the nodes have the right cluster config?
ccs_tool update /etc/cluster/cluster.conf ?

What are using you using to manage the config? ricci/luci,
system-config-cluster, vi?

Can we see a "real" cluster.conf?

Finally, are you my chance using the HA-LVM stuff to manage disks
across nodes or are you using GFS?


As I said above, and you clearly agree, something is not right and the
more information you can share, the better.




Okham was right for the most part.


Corey


On Mon, May 17, 2010 at 10:00 PM, Dusty <dhoffutt at gmail.com> wrote:
> Addendum, and a symptom of this issue is that because node1 does not reboot
> and rejoin the cluster, the service it was running never relocates.
>
> I left it in this state over the weekend. Came back Monday morning, the
> service had still not relocated.
>
> It is not fencing. It is not fencing. Fencing works. Fencing works.
>
> On Mon, May 17, 2010 at 3:58 PM, Dusty <dhoffutt at gmail.com> wrote:
>>
>> I appreciate the help - but I'm saying the same thing for like the fourth
>> or fifth time now.
>>
>> Fencing is working well. All cluster nodes are able to communicate to the
>> fence device (APC PDU).
>>
>> Another example. The cluster is quorate with five nodes and four running
>> services. Am pulling the plug on node1. I need service1, that happens to be
>> running on node1 right now, to relocate ASAP - not after node1 has rebooted.
>> Node3 is a member of the cluster and is available to accept a service
>> relocation.
>>
>> I have cluster ssh logged into all nodes and am tailing their
>> /var/log/messages file.
>>
>> 14:57 "Pulling the plug" on node1 now (really just turning off the
>> electrical port on the APC).
>> About five seconds later....
>> 14:58:00 node2 fenced[7584] fencing node "192.168.1.1"
>> 14:58:05 node2 fenced[7584] fence "192.168.1.1" success
>> --- Right now the service SHOULD be being relocated - but it doesn't! ---
>> -- a few minutes later, node1 has rebooted after being successfully fenced
>> via node2 operating the APC PDU.
>> 15:03:43 node3 clurgmgrd[4813]: <notice> Recovering failed service
>> service:service1
>>
>> Second test now - doing the exact same thing, but this time really pulling
>> the plug on node1.
>>
>> Everything happens the same except node2 fencing node1 has no effect
>> because I've simulated a complete node failure on node1. It is not going to
>> boot.
>>
>>
>> On Sat, May 15, 2010 at 1:25 PM, Corey Kovacs <corey.kovacs at gmail.com>
>> wrote:
>>>
>>> The reason I was pointing at the fencing config is that the service
>>> will only re-locate when fenced is able to confirm that the offending
>>> node has been fenced. If this can't happen, then services will not
>>> relocate since the cluster doesn't know the state of all the nodes. If
>>> a node get's an anvil dropped on it, then it should stop responding
>>> and the cluster should then try to invoke the fence on that node to
>>> make sure that it is indeed dead, even if it only cycles the power
>>> port for n already dead node.
>>>
>>> Given you description you should experience the same "problem" if you
>>> simply turn the node off. Nomally, when you turn the power off (not
>>> pull the plug) then boot the node, the cluster either should have
>>> aleady fenced the node, or it will fence as it's booting. Looks odd
>>> but it's correct since the cluster has to get things to a known state.
>>>
>>> After the fence and before the node boots, services should start
>>> migrating. All of this you probably know but it's worth saying anwyay.
>>>
>>> Basically, if your services only migrate after the node boots up, then
>>> I believe fencing is not working properly. The services should migrate
>>> while the node is booting or even before.
>>>
>>> So it appears to me that when you power the apc yourself, or pull the
>>> plug on the node, you have the same condition.
>>>
>>> The way to really testing fencing, is to watch the logs on a node and
>>> issue
>>>
>>> cman_tool kill <cluster memner> and tell cman to fence the node.
>>>
>>> One thought, can all your cluster nodes talk the APC at all times?
>>>
>>>
>>> -Corey
>>>
>>>
>>>
>>>
>>> On Sat, May 15, 2010 at 5:50 PM, Dusty <dhoffutt at gmail.com> wrote:
>>> > Fencing works great - no problems there. The APC PDU responds
>>> > beautifully to
>>> > any node's attempt to fence.
>>> >
>>> > The issue is this:
>>> >
>>> > The service only relocates after the fenced node reboots and rejoins
>>> > the
>>> > cluster. Then the service relocates to another node. This happens well
>>> > and
>>> > without fail.
>>> >
>>> > But what if the node that was fenced refuses to boot back up because,
>>> > say an
>>> > anvil fell out of the sky and smashed it, or its motherboard fried?
>>> >
>>> > This is what I am simulating by pulling the plug on a node that happens
>>> > to
>>> > be running a service. The service will not relocate until the failed
>>> > node
>>> > has rebooted.
>>> >
>>> > I don't want that. I want the service to relocate ASAP regardless of if
>>> > the
>>> > failed node reboots or not.
>>> >
>>> > Thank you so much for your consideration.
>>> >
>>> > --
>>> > Linux-cluster mailing list
>>> > Linux-cluster at redhat.com
>>> > https://www.redhat.com/mailman/listinfo/linux-cluster
>>> >
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>




More information about the Linux-cluster mailing list