[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Problem with fenced on cluster with 2 BladeCentermachines: 1st machine is remove physically. The remaining one doesnot became Active (waiting for fenced)



Thistle, Scott wrote:

I am having the same issue. If a blade is not present (i.e. removed for
maintenance), the fence_bladecenter cannot check the state as it is
reported empty. I think it is something simple to fix for those versed
in perl. Normally the fence only runs against a blade that is present.
If the blade is removed while running, you run into this issue.

I believe this is what you want to happen...if state cannot be checked, fenced keeps trying. How could you determine it was safe to stop without persisting some value like the number of fence tries, and trying to reason out whether it was safe to stop? This will not happen if you remove the blade from the cluster before physically removing it. It is a snap to do this with one of the UIs, if you are not prejudiced against UIs :).

Also, removing the node from cluster membership before jerking it out of the rack tells rgmanager to move any services off of it - rather than having to depend on heartbeat failure to make this happen.

That said, if the blade catches fire and a cage IT guy notices and jerks it quick, (using his IT Oven Mitt, of course) it is silly for fenced to keep incessantly trying when the thing no longer even exists. Perhaps the correct solution would be to have the fence_bladecenter report success if the bladecenter admin unit reports that 'no status is available' for a particular blade - obviously if the thing is not there, it should be safe to say it is fenced :)

If this addresses your situation (I think it does), now would be a REALLY good time to file a ticket requesting this behavior - like today! I'll post a fixed version to the ticket when it is ready.

Thanks to Lon for discussing this with me...;)

Regards,

-Jim


My case below. Blade #3 is a good node. Blade #2 was removed. The fence
does not work with the blade removed.

system> env -T system:blade[3]
OK
system:blade[3]> power -state
On
system:blade[3]> env -T system:blade[2]
The target bay is empty. system:blade[3]> env -T system:blade[1]
OK
system:blade[1]>

-----Original Message-----
From: linux-cluster-bounces redhat com
[mailto:linux-cluster-bounces redhat com] On Behalf Of James Parsons
Sent: Thursday, July 12, 2007 12:33 PM
To: linux clustering
Subject: Re: [Linux-cluster] Problem with fenced on cluster with 2
BladeCentermachines: 1st machine is remove physically. The remaining one
doesnot became Active (waiting for fenced)

catalin lupescu bull net wrote:

Hello!

I have a Cluster Redhat made with 2 nodes IBM blades on Blade Center chassis.
(fenced version 1.32.6)

I have done the following test:
I have removed physically the node 1 machine (the Active one).
The second one is never became active one. "Clustat" command does not printing any information.
In /var/log/messages we can found the following messages (repeated):

Jul 11 17:46:24 cdrc1-2 fenced[4214]: fencing node "cdrc1-1"
Jul 11 17:46:38 cdrc1-2 fenced[4214]: agent "fence_bladecenter" reports: pattern match timed-out at /sbin/fence_bladecenter line 185 Jul 11 17:46:38 cdrc1-2 fenced[4214]: fence "cdrc1-1" failed

If the node 1 is plugged, the node 2 became Active (fenced OK)

bz#240509 changed the sleep timeout in the bladecenter agent from 5 to
10...this is on or about line 193 in /sbin/fence_bladecenter.  See what
yours is set at, and try pushing it out a bit. This minor change is
making its way through the distribution chain now.

-j

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster

--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]