[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] Re: Cluster Suite 4 failover problem



I had a very similar set of problems just recently and found that uninstalling the "fence" RPM solved about 90% of them, incuding a hanging RGManager which required me to "switch-off-and-switch-on" the servers many times. I suspect that the problem was more to do with my unfamiliarity with fencing, but I wonder if there are some issues when running fencing and having no fence devices in use, and how the fenced daemon then interacts with RGManager. I do know that there are (fixed) similar lockup issues with RG Manager rgmanager-1.9.46-0, CMAN cman-kernel-2.6.9-43.8 and kernel 2.6.9-34 which disappear with an upgrade to kernel 2.6.9-34.0.1 and CMan cman-kernel-smp-2.6.9-43.8.3, but a new set of problems were introduced for me when I did that so I rolled back and uninstalled fenced, et viola!

I still find that on occasion I have to kill -9 the rgmanger process (sometimes I have to do it more than once) and I realise that an unfenced cluster is unsupported, but it solved the problems for me.

Hope this helps,
Jon




Dicky wrote:

Hi,

Thx for the reply. :)

Yes, i have installed the 'fence' rpm, and others according to the Redhat Cluster Suite documenation's "RPM Selection Criteria: Red Hat Cluster Suite with DLM"
, following are the rpms i have installed:

=====RPM Installed=====

ccs, fence, gulm, iddev, magma, magma-plugins, perl-Net-Telnet, system-config-cluster, ipvsadm,
piranha, ccs-devel, gulm-devel, iddev-devel, magma-devel,

====END=======

I didn't install GFS.

Here is the /var/log/messages output when i try to restart the rgmanager service from the failed node after i re-enable eth0:

===/var/log/messages ==

rgmanager: [1074]: <notice> Shutting down Cluster Service Manager...
clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection timed out clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection timed out
clurgmgrd[31777]: <warning> #67: Shutting down uncleanly
clurgmgrd: [31777]: <info> Executing /etc/rc.d/init.d/vsftpd stop
clurgmgrd: [31777]: <info> Executing /etc/rc.d/init.d/httpd stop
vsftpd: vsftpd shutdown succeeded
clurgmgrd: [31777]: <info> Removing IPv4 address 192.168.0.112 from eth0
httpd: httpd shutdown succeeded
clurgmgrd: [31777]: <info> Removing IPv4 address 192.168.0.111 from eth0

=======END============

Then it hanged forver until i manually reset the machine.

I would like to know if the waiting is caused by this line :"
clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection timed out
" ?? If so, why and how to solve it??

Also, i would like to know even i type " reboot" , it also hanged in this line: "Shutting down Cluster Service Manager... Waiting for services to stop: " which caused me have press the reset button, which may caused the file system corrupted, so manually press the reset button is dangerous.
Is there anyway for me to shutdown the rgmanager properly?


Second question is, why the cluster didn't failover but the status showed that the services were "started" ??? Is there anything i missed in the configuration process??

Many thanks,
Dicky



Hi,

What is output to the "/var/log/messages" files of
each node? That should provide a clue as to what the problem is. Also, did you install the 'fence' RPM and any Clustered LVM / GFS RPMs?

You also might consider rebooting the "downed" node
- this function is generally taken care of by fencing devices
automatically and, as I understand it, "manual fencing" means you gotta
reboot :), the assumption being that a failed node won't be allowed
back in the cluster until it's restarted.

Thanks,
Jon


--
Linux-cluster mailing list
Linux-cluster redhat com
https://www.redhat.com/mailman/listinfo/linux-cluster



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]