[Linux-cluster] Re: Cluster Suite 4 failover problem

Fri Oct 20 02:10:47 UTC 2006

Hi,

Thx for the reply. :)

Yes, i have installed the 'fence' rpm, and others according to the Redhat Cluster Suite documenation's "RPM Selection Criteria: Red Hat Cluster Suite with DLM"
, following are the rpms i have installed:

=====RPM Installed=====

ccs, fence, gulm, iddev, magma, magma-plugins, perl-Net-Telnet,	system-config-cluster, ipvsadm,
piranha, ccs-devel, gulm-devel, iddev-devel, magma-devel,

====END=======

I didn't install GFS.

Here is the /var/log/messages output when i try to restart the rgmanager service from the failed node after i re-enable eth0:

===/var/log/messages ==

rgmanager: [1074]: <notice> Shutting down Cluster Service Manager...
clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection timed out
clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection timed out
clurgmgrd[31777]: <warning> #67: Shutting down uncleanly
clurgmgrd: [31777]: <info> Executing /etc/rc.d/init.d/vsftpd stop
clurgmgrd: [31777]: <info> Executing /etc/rc.d/init.d/httpd stop
vsftpd: vsftpd shutdown succeeded
clurgmgrd: [31777]: <info> Removing IPv4 address 192.168.0.112 from eth0
httpd: httpd shutdown succeeded
clurgmgrd: [31777]: <info> Removing IPv4 address 192.168.0.111 from eth0

=======END============

Then it hanged forver until i manually reset the machine.

I would like to know if the waiting is caused by this line :"
clurgmgrd[31777]: <err> #50: Unable to obtain cluster lock: Connection timed out
" ?? If so, why and how to solve it??

Also, i would like to know even i type " reboot" , it also hanged in this line: "Shutting down Cluster Service Manager...
Waiting for services to stop: " which caused me have press the reset button, which may caused the file system corrupted, so manually press the reset button is dangerous.
Is there anyway for me to shutdown the rgmanager properly?

Second question is, why the cluster didn't failover but the status showed that the services were "started" ??? Is there anything i missed in the configuration process??

Many thanks,
Dicky

> Hi,
> 
> What is output to the "/var/log/messages" files of
> each node? That 
> should provide a clue as to what the problem is. 
> Also, did you install 
> the 'fence' RPM and any Clustered LVM / GFS RPMs?
> 
> You also might consider rebooting the "downed" node
> - this function is 
> generally taken care of by fencing devices
> automatically and, as I 
> understand it, "manual fencing" means you gotta
> reboot :), the 
> assumption being that a failed node won't be allowed
> back in the cluster 
> until it's restarted.
> 
> Thanks,
> Jon