[Linux-cluster] why all services stops when a node reboots?

Fri Feb 13 10:44:35 UTC 2009

hello all

following with the problem, anyone can explain this:

The commands are run all in aprox 1 minute:

disable the service
[root at NODE2 log]# clusvcadm -d BBDD
Local machine disabling service:BBDD...Yes

enable the service
[root at NODE2 log]# clusvcadm -e BBDD
Local machine trying to enable service:BBDD...Success
service:BBDD is now running on node2

its ok, the service is running in node2, try to relocate to node1
root at NODE2 log]# clusvcadm -r BBDD -m node1
Trying to relocate service:BBDD to node1...Success

it works!!! fine, try to relocate again to node2

service:BBDD is now running on node1
[root at NODE2 log]# clusvcadm -r BBDD -m node2
Trying to relocate service:BBDD to node2...Success

it works again !!! I cant believe it. Try to relocate to node1 again

service:BBDD is now running on node2
[root at NODE2 log]# clusvcadm -r BBDD -m node1
Trying to relocate service:BBDD to node1...Failure

Opps!! it fails!!! Why? why 30 secs before it works and now it fails?

In this situation all I can do is enable an disable the service again to get
it works. It never gets up automatically...
[root at NODE2 log]# clusvcadm -d BBDD
Local machine disabling service:BBDD...Yes
[root at NODE2 log]# clusvcadm -e BBDD
Local machine trying to enable service:BBDD...Success
service:BBDD is now running on node2

Any explanation for this behaviour???

I´m complety astonished :-(

TIA

ESG

2009/2/13 ESGLinux <esggrupos at gmail.com>

> More clues,
>
> using system-config-cluster
>
> When I try to run a service in state failed I always get an error.
> I have tu disable the service, to get disabled state. With this state I can
> restart the services.
>
> I think I have a problem with the relocate because I cant do it nor with
> luci nor with system-config-cluster nor with clusvadm
>
> I always get error when i try this
>
> greetings
>
> ESG
>
>
> 2009/2/13 ESGLinux <esggrupos at gmail.com>
>
>> Hello,
>>
>> The services run ok on node1. If I halt node2 and try to run the services
>> the run ok on node1.
>> If I run the services without cluster they also run ok.
>>
>> I have eliminated the HTTP services and I have left the service BBDD to
>> debug the problem. Here is the log when the service is running on node2 and
>> node1 comes up:
>>
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering GATHER state from
>> 11.
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Creating commit token because
>> I
>> am
>> the rep.
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Saving state aru 1a high seq
>> receiv
>> ed 1a
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Storing new sequence id for
>> ring
>> 17
>> f4
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering COMMIT state.
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering RECOVERY state.
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [0] member
>> 192.168.1.185:
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep
>> 192.168.
>> 1.185
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 1a high delivered 1a
>> received
>> f
>> lag 1
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] position [1] member
>> 192.168.1.188:
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] previous ring seq 6128 rep
>> 192.168.
>> 1.188
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] aru 9 high delivered 9
>> received
>> fla
>> g 1
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Did not need to originate any
>> messa
>> ges in recovery.
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] Sending initial ORF token
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] CLM CONFIGURATION CHANGE
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] New Configuration:
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.185)
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Left:
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] Members Joined:
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ]    r(0) ip(192.168.1.188)
>> Feb 13 09:16:00 NODE2 openais[3326]: [SYNC ] This node is within the
>> primary component and will provide service.
>> Feb 13 09:16:00 NODE2 openais[3326]: [TOTEM] entering OPERATIONAL state.
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message
>> 192.168.1.185
>> Feb 13 09:16:00 NODE2 openais[3326]: [CLM  ] got nodejoin message
>> 192.168.1.188
>> Feb 13 09:16:00 NODE2 openais[3326]: [CPG  ] got joinlist message from
>> node 2
>> Feb 13 09:16:03 NODE2 kernel: dlm: connecting to 1
>> Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Relocating service:BBDD to
>> better node node1
>> Feb 13 09:16:24 NODE2 clurgmgrd[4001]: <notice> Stopping service
>> service:BBDD
>> Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb
>> > Failed - Application Is Still Running
>> Feb 13 09:16:25 NODE2 clurgmgrd: [4001]: <err> Stopping Service mysql:mydb
>> > Failed
>> Feb 13 09:16:25 NODE2 clurgmgrd[4001]: <notice> stop on mysql "mydb"
>> returned 1 (generic error)
>> Feb 13 09:16:25 NODE2 avahi-daemon[3872]: Withdrawing address record for
>> 192.168.1.183 on eth0.
>> Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <crit> #12: RG service:BBDD failed
>> to stop; intervention required
>> Feb 13 09:16:35 NODE2 clurgmgrd[4001]: <notice> Service service:BBDD is
>> failed
>> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <warning> #70: Failed to relocate
>> service:BBDD; restarting locally
>> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <err> #43: Service service:BBDD has
>> failed; can not start.
>> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #2: Service service:BBDD
>> returned failure code.  Last Owner: node2
>> Feb 13 09:16:36 NODE2 clurgmgrd[4001]: <alert> #4: Administrator
>> intervention required.
>>
>>
>> As you can see in the message "Relocating service:BBDD to better node
>> node1"
>>
>> But it fails
>>
>> Another error that appears frecuently in my logs is the next:
>>
>> <err> Checking Existence Of File /var/run/cluster/mysql/mysql:mydb.pid
>> [mysql:mydb] > Failed - File Doesn't Exist
>>
>> I dont know if this is important. but I think this makes the message err>
>> Stopping Service mysql:mydb > Failed - Application Is Still Running and this
>> makes the service fails (I´m just guessing...)
>>
>> Any idea?
>>
>>
>> ESG
>>
>>
>> 2009/2/12 rajveer singh <torajveersingh at gmail.com>
>>
>>> Hi,
>>>
>>> Ok, perhaps there is some problem with the services on node1 , so, are
>>> you able to run these services on node1 without cluster. You first stop the
>>> cluster, and try to run these services on node1.
>>>
>>> It should run.
>>>
>>> Re,
>>> Rajveer Singh
>>>
>>> 2009/2/13 ESGLinux <esggrupos at gmail.com>
>>>
>>> Hello,
>>>>
>>>> Thats what I want, when node1 comes up I want to relocate to node1 but
>>>> what I get is all my services stoped and in failed state.
>>>>
>>>> With my configuration I expect to have the services running on node1.
>>>>
>>>> Any idea about this behaviour?
>>>>
>>>> Thanks
>>>>
>>>> ESG
>>>>
>>>>
>>>> 2009/2/12 rajveer singh <torajveersingh at gmail.com>
>>>>
>>>>
>>>>>
>>>>> 2009/2/12 ESGLinux <esggrupos at gmail.com>
>>>>>
>>>>>>  Hello all,
>>>>>>
>>>>>> I´m testing a cluster using luci as admin tool. I have configured 2
>>>>>> nodes with 2 services http + mysql. This configuration works almost fine. I
>>>>>> have the services running on the node1
>>>>>>  and y reboot this node1. Then the services relocates to node2 and all
>>>>>> contnues working but, when the node1 goes up all the services stops.
>>>>>>
>>>>>> I think that the node1, when comes alive, tries to run the services
>>>>>> and that makes the services stops, can it be true? I think node1 should not
>>>>>> start anything because the services are running in node2.
>>>>>>
>>>>>> Perphaps is a problem with the configuration, perhaps with fencing (i
>>>>>> have not configured fencing at all)
>>>>>>
>>>>>> here is my cluster.conf. Any idea?
>>>>>>
>>>>>> Thanks in advace
>>>>>>
>>>>>> ESG
>>>>>>
>>>>>>
>>>>>> <?xml version="1.0"?>
>>>>>> <cluster alias="MICLUSTER" config_version="29" name="MICLUSTER">
>>>>>>         <fence_daemon clean_start="0" post_fail_delay="0"
>>>>>> post_join_delay="3"/>
>>>>>>         <clusternodes>
>>>>>>                 <clusternode name="node1" nodeid="1" votes="1">
>>>>>>                         <fence/>
>>>>>>                 </clusternode>
>>>>>>                 <clusternode name="node2" nodeid="2" votes="1">
>>>>>>                         <fence/>
>>>>>>                 </clusternode>
>>>>>>         </clusternodes>
>>>>>>         <cman expected_votes="1" two_node="1"/>
>>>>>>         <fencedevices/>
>>>>>>         <rm>
>>>>>>                 <failoverdomains>
>>>>>>                         <failoverdomain name="DOMINIOFAIL"
>>>>>> nofailback="0" ordere
>>>>>> d="1" restricted="1">
>>>>>>                              *   <failoverdomainnode name="node1"
>>>>>> priority="1"/>
>>>>>> *                               * <failoverdomainnode name="node2"
>>>>>> priority="2"/>
>>>>>> *                        </failoverdomain>
>>>>>>                 </failoverdomains>
>>>>>>                 <resources>
>>>>>>                         <ip address="192.168.1.183" monitor_link="1"/>
>>>>>>                 </resources>
>>>>>>                 <service autostart="1" domain="DOMINIOFAIL"
>>>>>> exclusive="0" name="
>>>>>> HTTP" recovery="relocate">
>>>>>>                         <apache config_file="conf/httpd.conf"
>>>>>> name="http" server
>>>>>> _root="/etc/httpd" shutdown_wait="0"/>
>>>>>>                         <ip ref="192.168.1.183"/>
>>>>>>                 </service>
>>>>>>                 <service autostart="1" domain="DOMINIOFAIL"
>>>>>> exclusive="0" name="
>>>>>> BBDD" recovery="relocate">
>>>>>>                         <mysql config_file="/etc/my.cnf"
>>>>>> listen_address="192.168
>>>>>> .1.183" name="mydb" shutdown_wait="0"/>
>>>>>>                         <ip ref="192.168.1.183"/>
>>>>>>                 </service>
>>>>>>         </rm>
>>>>>> </cluster>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> Linux-cluster mailing list
>>>>>> Linux-cluster at redhat.com
>>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>>
>>>>>
>>>>> Hi ESG,
>>>>>
>>>>> Offcoures, as you have defined the priority of node1 as 1 and node2 as
>>>>> 2, so node1 is having more priority, so whenever it will be up, it will try
>>>>> to  run the service on itself and so it will relocate the service from node2
>>>>> to node1.
>>>>>
>>>>>
>>>>> Re,
>>>>> Rajveer Singh
>>>>>
>>>>> --
>>>>> Linux-cluster mailing list
>>>>> Linux-cluster at redhat.com
>>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>>
>>>>
>>>>
>>>> --
>>>> Linux-cluster mailing list
>>>> Linux-cluster at redhat.com
>>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>>
>>>
>>>
>>> --
>>> Linux-cluster mailing list
>>> Linux-cluster at redhat.com
>>> https://www.redhat.com/mailman/listinfo/linux-cluster
>>>
>>
>>
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20090213/58186b29/attachment.htm>