[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] service stuck in "starting" state



jason monsterjam org wrote:
On Fri, Jul 10, 2009 at 04:50:12PM -0700, Rick Stevens wrote:
jason monsterjam org wrote:
hey cluster gurus..
I have a 2 node cluster thats been running without issue for quite a while.. all of a sudden one of the nodes will not completely start the apache webserver service.. it looks like this [root tf1 ~]# clustat
Member Status: Quorate
  Member Name                              Status
  ------ ----                              ------
  tf1                                      Online, Local, rgmanager
  tf2                                      Online, rgmanager
Service Name Owner (Last) State ------- ---- ----- ------ ----- Apache Service tf1 starting postfix service tf1 started [root tf1 ~]# and I see that the httpd is NOT started. although, if I do /etc/init.d/httpd start
the service starts without issue.
grepping for apache and http in the logs, I see this..
Jul 10 14:32:13 tf1 httpd: httpd shutdown failed
Jul 10 14:32:52 tf1 httpd: httpd shutdown failed
Jul 10 14:33:11 tf1 httpd: httpd shutdown failed
Jul 10 14:33:57 tf1 httpd: Syntax error on line 117 of /etc/httpd/conf.d/ssl.conf: Jul 10 14:33:57 tf1 httpd: SSLCertificateFile: file '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
Jul 10 14:33:57 tf1 httpd: httpd startup failed
Jul 10 14:34:06 tf1 httpd: Syntax error on line 117 of /etc/httpd/conf.d/ssl.conf: Jul 10 14:34:06 tf1 httpd: SSLCertificateFile: file '/etc/httpd/conf/ssl.crt/server.crt' does not exist or is empty
Jul 10 14:34:06 tf1 httpd: httpd startup failed
Jul 10 14:34:08 tf1 httpd: httpd shutdown failed
Jul 10 16:23:33 tf1 clurgmgrd: [6168]: <info> Executing /etc/init.d/httpd stop Jul 10 16:23:34 tf1 httpd: httpd shutdown failed
Jul 10 16:24:31 tf1 httpd: httpd shutdown failed
Jul 10 16:24:36 tf1 httpd: httpd shutdown failed
Jul 10 16:24:41 tf1 httpd: httpd startup succeeded
Jul 10 18:10:13 tf1 clurgmgrd: [6231]: <info> Executing /etc/init.d/httpd stop Jul 10 18:10:13 tf1 httpd: httpd shutdown failed
Jul 10 18:22:00 tf1 httpd: httpd startup succeeded
[root tf1 log]# grep apache  messages
Jul 10 04:40:00 tf1 clurgmgrd[6267]: <notice> stop on script "cluster_apache" returned 1 (generic error) Jul 10 10:04:33 tf1 clurgmgrd[6149]: <notice> stop on script "cluster_apache" returned 1 (generic error) Jul 10 14:29:54 tf1 clurgmgrd[6281]: <notice> stop on script "cluster_apache" returned 1 (generic error) Jul 10 16:23:34 tf1 clurgmgrd[6168]: <notice> stop on script "cluster_apache" returned 1 (generic error) Jul 10 18:10:13 tf1 clurgmgrd[6231]: <notice> stop on script "cluster_apache" returned 1 (generic error) [root tf1 log]# Im guessing its the stop on script "cluster_apache" returned 1 (generic error) but I looked at the /etc/init.d/httpd on tf1 and tf2 and they are both the same size
[root tf2 ~]# ls -al /etc/init.d/httpd
-rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
[root tf1 log]# ls -al /etc/init.d/httpd
-rwxr-xr-x  1 root root 3201 Jan 30  2007 /etc/init.d/httpd
and the apache service starts/stops just fine on tf2 when the services get failed over to that machine.
any ideas on what can be wrong?
tf1 is complaining about a bad SSL cert.  The fact that it's complaining
when being started by clurgmgrd but not when started manually indicates
that clurgmgrd is starting it differently (specifying a different
httpd.conf file perhaps?).

well, heres the relevant part of my config file
        <rm>
                <failoverdomains>
                        <failoverdomain name="httpd" ordered="1" restricted="1">
                                <failoverdomainnode name="tf1" priority="1"/>
                                <failoverdomainnode name="tf2" priority="2"/>
                        </failoverdomain>
                </failoverdomains>
                <resources>
                        <script file="/etc/init.d/httpd" name="cluster_apache"/>
                        <ip address="192.168.1.7" monitor_link="1"/>
                        <script file="/etc/init.d/postfix" name="cluster_posstfix"/>
                </resources>
                <service autostart="1" domain="httpd" name="Apache Service">
                        <ip ref="192.168.1.7"/>
                        <script ref="cluster_apache"/>
                </service>
                <service autostart="1" domain="httpd" name="postfix service">
                        <ip ref="192.168.1.7"/>
                        <script ref="cluster_posstfix"/>
                </service>
        </rm>

ive never seen that ssl error when starting the service manually.


the other thing that I noticed.. is that when I try to do
[root tf1 cluster]# clusvcadm -d "Apache Service"
Member tf1 disabling Apache Service...

it just hangs there and never returns.

Sorry about the delay in responding.  Was out of town for the weekend.

Does clusvcadm or clurgmgrd run as a different user...one that either
can't read the SSL certs or the directory containing them?  Normally
the stuff in /etc/init.d runs as root.  Running one of those scripts as
a different user can lead to lots of permissions issues.  It's bitten
me before.
----------------------------------------------------------------------
- Rick Stevens, Systems Engineer                      ricks nerd com -
- AIM/Skype: therps2        ICQ: 22643734            Yahoo: origrps2 -
-                                                                    -
- Millihelen, adj: The amount of beauty required to launch one ship. -
----------------------------------------------------------------------


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]