Hi all -
I hope someone can shed some light on this. I have a 2-node cluster running on RedHat 3 which has a shared /clust1 filesystem and is connected to a network power switch. There is something very wrong with the cluster, as every day currently it is rebooting whichever is the primary node, for no reason I can track down. No hardware faults anywhere in the cluster, no failures of any kind logging in any log files, etc etc. It started out well over a year ago rebooting the primary node every other week, then across time it progressed to once a week, then once a day. I logged a call with RedHat way back when it first started; nothing was ever found to be the problem, and of course in time, RedHat v3 went out of support and they would no longer assist in troubleshooting the issue. Prior to this problem starting the cluster had been running happily with no issues for about 5 years.
Now this cluster is shortly being replaced with new hardware and RedHat 5, so hopefully whatever is the problem will as mysteriously vanish as it appeared. However, I need to stop this daily reboot as it is playing havoc with the application that runs on this system (a heavily-utilised database) and having tried everything I can think of, I decided to 'break' the cluster; ie, take down one node so that only one node remains running the application.
I cannot find a way to do this that persists across a reboot of the node that should be out of the cluster. I've run "/sbin/chkconfig --del clumanager" and it did take the service out of chkconfig (I verified this). The RedHat document http://docs.redhat.com/docs/en-US/Red_Hat_Enterprise_Linux/3/html/Cluster_Administration/s1-admin-disable.html seems to indicate this should persist across a reboot - ie, you reboot the node and it does not attempt to rejoin the cluster; however, this didn't work! The primary node cluster monitoring software saw that the secondary node was down, STONITH kicked in, the NPS powered the port this node is connected to off and back on, the secondary node rebooted and rejoined the cluster!
Does anyone know how to either temporarily remove the secondary node from the cluster in such a way that persists across reboots but can be easily brought back into the cluster when needed, or else (and preferably) how to temporarily stop the cluster monitoring software running on the primary node from even looking out for the secondary node - as in, it doesn't care whether the secondary node is up or not? I've checked for the period the secondary node is down that the primary node is quite happy to carry on processing as usual but as soon as the cluster monitoring software on the primary node realises the secondary node is down, it reboots it, and I'm back to square one!
This is now really urgent (I've been trying to find an answer to this for some weeks now) as I go on holiday on Friday and I really don't want to leave my second-in-command with a mess on his hands!
Everything that has a beginning has an ending. Make your peace with that and all will be well.