[Linux-cluster] how to disable one node

Wed Jul 6 12:13:38 UTC 2011

Hi all -

I hope someone can shed some light on this.  I have a 2-node
cluster running on RedHat 3 which has a shared /clust1 filesystem
and is connected to a network power switch.  There is something
very wrong with the cluster, as every day currently it is
rebooting whichever is the primary node, for no reason I can
track down.  No hardware faults anywhere in the cluster, no
failures of any kind logging in any log files, etc etc.   It
started out well over a year ago rebooting the primary node every
other week, then across time it progressed to once a week, then
once a day.  I logged a call with RedHat way back when it first
started; nothing was ever found to be the problem, and of course
in time, RedHat v3 went out of support and they would no longer
assist in troubleshooting the issue.  Prior to this problem
starting the cluster had been running happily with no issues for
about 5 years.

Now this cluster is shortly being replaced with new hardware and
RedHat 5, so hopefully whatever is the problem will as
mysteriously vanish as it appeared.  However, I need to stop this
daily reboot as it is playing havoc with the application that
runs on this system (a heavily-utilised database) and having
tried everything I can think of, I decided to 'break' the
cluster; ie, take down one node so that only one node remains
running the application.

I cannot find a way to do this that persists across a reboot of
the node that should be out of the cluster.  I've run
"/sbin/chkconfig --del clumanager" and it did take the service
out of chkconfig (I verified this).  The RedHat document
http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html
/Cluster_Administration/s1-admin-disable.html seems to indicate
this should persist across a reboot - ie, you reboot the node and
it does not attempt to rejoin the cluster; however, this didn't
work!  The primary node cluster monitoring software saw that the
secondary node was down, STONITH kicked in, the NPS powered the
port this node is connected to off and back on, the secondary
node rebooted and rejoined the cluster!

Does anyone know how to either temporarily remove the secondary
node from the cluster in such a way that persists across reboots
but can be easily brought back into the cluster when needed, or
else (and preferably) how to temporarily stop the cluster
monitoring software running on the primary node from even looking
out for the secondary node - as in, it doesn't care whether the
secondary node is up or not?  I've checked for the period the
secondary node is down that the primary node is quite happy to
carry on processing as usual but as soon as the cluster
monitoring software on the primary node realises the secondary
node is down, it reboots it, and I'm back to square one!

This is now really urgent (I've been trying to find an answer to
this for some weeks now) as I go on holiday on Friday and I
really don't want to leave my second-in-command with a mess on
his hands!

thanks
-- 
  Helen Heath
  helen_heath at fastmail.fm

=*=
Everything that has a beginning has an ending. Make your peace with that and all will be well.
-- Buddhist saying

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20110706/5c68a6a7/attachment.htm>