[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Linux-cluster] what happens during orderly shutdown ?



Hi,

 

We have hit a snag during testing of a 3-node cluster, where only nodes 1 and 2 are allowed to run the service.

 

If we shutdown the node running the service, the service first starts on the surviving node, but is then stopped.  This is not an activity we would normally expect to do, but some of the environment monitoring in our servers can trigger this situation - temperature getting too high etc.

 

What I think happens is the following:

 

1.    host 1: shutdown calls on rgmanager to stop:  this takes several seconds, stopping the service.

2.    host 2: rgmanager starts to relocate the service to the surviving node

3.    host 1: shutdown calls on cman to stop

4.    host 2: while relocating the service, cman enters gather state because host 1 has dropped out.  The service is still starting and triggers "Failed changing service status" in log below.

5.    host 1: halts

6.    host 2: the service completes start up.  The cluster reconfiguration completes. 

7.    host 2: the service is stopped.

 

Is this a bug in the cluster system, or just an unfortunate sequence of events that the cluster cannot cope with, i.e. node failure while a service is relocating ?

 

regards,

Martin

 

Here is the log from node 2:

 

Aug 16 16:58:40 gmhcluster2 clurgmgrd[4132]: <notice> Member 1 shutting down

Aug 16 16:58:40 gmhcluster2 openais[4084]: [TOTEM] Retransmit List: 86

Aug 16 16:58:46 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped service service:MASTERVIP

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] The token was lost in the OPERATIONAL state.

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Receive multicast socket recv buffer size (320000 bytes).

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Transmit multicast socket send buffer size (262142 bytes).

Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state from 2.

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <err> #47: Failed changing service status

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #71: Relocating failed service service:MASTERVIP

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #70: Failed to relocate service:MASTERVIP; restarting locally

Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped service service:MASTERVIP

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state from 0.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Creating commit token because I am the rep.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Saving state aru 90 high seq received 90

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Storing new sequence id for ring 1e4

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering COMMIT state.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering RECOVERY state.

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [0] member 172.30.31.4:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480 rep 172.30.31.3

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered 90 received flag 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [1] member 172.30.31.5:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480 rep 172.30.31.3

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered 90 received flag 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Did not need to originate any messages in recovery.

Aug 16 16:59:15 gmhcluster2 kernel: dlm: closing connection to node 1

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Sending initial ORF token

Aug 16 16:59:15 gmhcluster2 clurgmgrd: [4132]: <info> Executing /usr/sbin/dc-cluster-resourceagent start

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM CONFIGURATION CHANGE

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New Configuration:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0) ip(172.30.31.4)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0) ip(172.30.31.5)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Left:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0) ip(172.30.31.3)

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Joined:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM CONFIGURATION CHANGE

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New Configuration:

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0) ip(172.30.31.4)

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - ACTIVESITE in state started

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0) ip(172.30.31.5)

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - ACTIVESITE service is running - ACTIVESITE in state started

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Left:

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: Performing precautionary updates to databases before setting up VIP

Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Joined:

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: running precautionary update [touch /tmp/dc-set-dr-counters]

Aug 16 16:59:15 gmhcluster2 openais[4084]: [SYNC ] This node is within the primary component and will provide service.

Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492] INFO - Service MASTERVIP: running precautionary update [touch /tmp/dc-set-dr-auto-increments]

Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering OPERATIONAL state.

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got nodejoin message 172.30.31.4

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got nodejoin message 172.30.31.5

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got joinlist message from node 3

Aug 16 16:59:16 gmhcluster2 clurgmgrd: [4132]: <info> Adding IPv4 address 172.30.30.93/24 to bond0

Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got joinlist message from node 2

Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Service service:MASTERVIP started

Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Stopping service service:MASTERVIP

Aug 16 16:59:17 gmhcluster2 clurgmgrd: [4132]: <info> Removing IPv4 address 172.30.30.93/24 from bond0

Aug 16 16:59:27 gmhcluster2 clurgmgrd: [4132]: <info> Executing /usr/sbin/dc-cluster-resourceagent stop

Aug 16 16:59:27 gmhcluster2 clurgmgrd[4132]: <notice> Service service:MASTERVIP is stopped

 

 

 

 

 


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]