Re: [Linux-cluster] Cluster services stopping


We have a similar problem, my server is runnig three services but only one of them restart sometimes without reason.
We have not a problem of high load in the server.
Could  happend months or weeks without any service restart.

The only difference with the others services is a ext3 file system on one shared external storage.

We have other installation with similar configuration and this problem is not happened. I'm checking the fs.sh script to add more debug info. I think this script may report same error and this may trigger the restart of service.

We have RHE4 U2 and only update rgmanager to "rgmanager-1.9.53-0"

At this moment I'm installing RHE4 U5 for testing and we try to update the production host later.
But my problem is that I'm not sure if this update will fix this issue.
Make an update in production is "complicated" and I will have serius troubles if this update not fix this issue.

Note: sorry for my bad english.
Scott McClanahan escribió:
I'm trying to figure out why my cluster services keep stopping for what
seems to be no obvious reason.  The obvious commonality between the
services being stopped are the following resources:  1 GFS file system,
1 IP address, and 1 or 2 init scripts.  The init scripts vary between
apache, tomcat, mysql, and squid.

Normally, if a process dies and a status check on the init script
returns a non-zero that event gets logged but that isn't happening when
these services are stopped.  An example of the first logged event
related to a failed service is shown below and then the service is
stopped and recovered.

"May 28 19:11:33 tf36 clurgmgrd[4418]: <notice> Stopping service twapp"

These nodes remain quite idle all of the time and have alot of
horsepower.  Some helpful information:

[smccl tf36 log]$rpm -q rgmanager cman

[smccl tf36 log]$uname -osrvmpi
Linux 2.6.9-34.ELhugemem #1 SMP Wed Mar 8 00:47:12 CST 2006 i686 i686
i386 GNU/Linux

[smccl tf36 log]$cat /etc/redhat-release CentOS release 4.3 (Final)

Any help is appreciated.  I can provide more information if you think it
is helpful.  Also, is there some sort of debugging within rgmanager I
can enable to see what is truly failing or timing out and requiring a
restart of these services?

