[Linux-cluster] service stuck in "starting" state
jason at monsterjam.org
jason at monsterjam.org
Sat Jul 11 00:59:04 UTC 2009
On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
> jason at monsterjam.org wrote:
>> hey cluster gurus..
>> I have a 2 node cluster thats been running without issue for quite a
>> while.. all of a sudden one of the nodes will not completely start the
>> apache webserver service.. it looks like this [root at tf1 ~]# clustat
>> Member Status: Quorate
>> Member Name Status
>> ------ ---- ------
>> tf1 Online, Local, rgmanager
>> tf2 Online, rgmanager
>> Service Name Owner (Last) State
>> ------- ---- ----- ------ ----- Apache
>> Service tf1 starting postfix
>> service tf1 started [root at tf1 ~]#
>> and I see that the httpd is NOT started. although, if I do
>> /etc/init.d/httpd start
>> the service starts without issue.
>> grepping for apache and http in the logs, I see this..
>> Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
>> Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
>> Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
>> Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of
>> /etc/httpd/conf.d/ssl.conf:
>> Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file
>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>> Jul 10 14:33:57 tf1 httpd: httpd startup failed
>> Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of
>> /etc/httpd/conf.d/ssl.conf:
>> Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file
>> '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
>> Jul 10 14:34:06 tf1 httpd: httpd startup failed
>> Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
>> Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd
>> stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
>> Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
>> Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
>> Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
>> Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd
>> stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
>> Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
>> [root at tf1 log]# grep apache messages
>> Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script
>> "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1
>> clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1
>> (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on
>> script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1
>> clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1
>> (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on
>> script "cluster_apache" returned 1 (generic error) [root at tf1 log]# Im
>> guessing its the stop on script "cluster_apache" returned 1 (generic
>> error)
>> but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the
>> same size
>> [root at tf2 ~]# ls -al /etc/init.d/httpd
>> -rwxr-xr-x 1 root root 3201 Jan 30 2007 /etc/init.d/httpd
>> [root at tf1 log]# ls -al /etc/init.d/httpd
>> -rwxr-xr-x 1 root root 3201 Jan 30 2007 /etc/init.d/httpd
>> and the apache service starts/stops just fine on tf2 when the services get
>> failed over to that machine.
>> any ideas on what can be wrong?
>
> tf1 is complaining about a bad SSL cert. The fact that it's complaining
> when being started by clurgmgrd but not when started manually indicates
> that clurgmgrd is starting it differently (specifying a different
> httpd.conf file perhaps?).
well, heres the relevant part of my config file
<rm>
<failoverdomains>
<failoverdomain name="httpd" ordered="1" restricted="1">
<failoverdomainnode name="tf1" priority="1"/>
<failoverdomainnode name="tf2" priority="2"/>
</failoverdomain>
</failoverdomains>
<resources>
<script file="/etc/init.d/httpd" name="cluster_apache"/>
<ip address="192.168.1.7" monitor_link="1"/>
<script file="/etc/init.d/postfix" name="cluster_posstfix"/>
</resources>
<service autostart="1" domain="httpd" name="Apache Service">
<ip ref="192.168.1.7"/>
<script ref="cluster_apache"/>
</service>
<service autostart="1" domain="httpd" name="postfix service">
<ip ref="192.168.1.7"/>
<script ref="cluster_posstfix"/>
</service>
</rm>
ive never seen that ssl error when starting the service manually.
the other thing that I noticed.. is that when I try to do
[root at tf1 cluster]# clusvcadm -d "Apache Service"
Member tf1 disabling Apache Service...
it just hangs there and never returns.
Jason
More information about the Linux-cluster
mailing list