[Linux-cluster] what happens during orderly shutdown ?

Mon Aug 16 19:02:15 UTC 2010

Hi Martijn,

I want to know how to setup a 3-node cluster environment can you suggested
me some good book or could you provide some of the technical articles which
i can follow and try it on to my test server.

On Tue, Aug 17, 2010 at 12:03 AM, Martijn <martijn at tenheuvel.net> wrote:

>
>
> -origineel bericht-
> Onderwerp: [Linux-cluster] what happens during orderly shutdown ?
> Van: "Martin Waite" <Martin.Waite at datacash.com>
> Datum: 16-08-2010 19:05
>
> Hi,
>
>
>
> We have hit a snag during testing of a 3-node cluster, where only nodes
> 1 and 2 are allowed to run the service.
>
>
>
> If we shutdown the node running the service, the service first starts on
> the surviving node, but is then stopped.  This is not an activity we
> would normally expect to do, but some of the environment monitoring in
> our servers can trigger this situation - temperature getting too high
> etc.
>
>
>
> What I think happens is the following:
>
>
>
> 1.    host 1: shutdown calls on rgmanager to stop:  this takes several
> seconds, stopping the service.
>
> 2.    host 2: rgmanager starts to relocate the service to the surviving
> node
>
> 3.    host 1: shutdown calls on cman to stop
>
> 4.    host 2: while relocating the service, cman enters gather state
> because host 1 has dropped out.  The service is still starting and
> triggers "Failed changing service status" in log below.
>
> 5.    host 1: halts
>
> 6.    host 2: the service completes start up.  The cluster
> reconfiguration completes.
>
> 7.    host 2: the service is stopped.
>
>
>
> Is this a bug in the cluster system, or just an unfortunate sequence of
> events that the cluster cannot cope with, i.e. node failure while a
> service is relocating ?
>
>
>
> regards,
>
> Martin
>
>
>
> Here is the log from node 2:
>
>
>
> Aug 16 16:58:40 gmhcluster2 clurgmgrd[4132]: <notice> Member 1 shutting
> down
>
> Aug 16 16:58:40 gmhcluster2 openais[4084]: [TOTEM] Retransmit List: 86
>
> Aug 16 16:58:46 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped
> service service:MASTERVIP
>
> Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] The token was lost in
> the OPERATIONAL state.
>
> Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Receive multicast
> socket recv buffer size (320000 bytes).
>
> Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] Transmit multicast
> socket send buffer size (262142 bytes).
>
> Aug 16 16:58:55 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state
> from 2.
>
> Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <err> #47: Failed changing
> service status
>
> Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #71: Relocating
> failed service service:MASTERVIP
>
> Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <warning> #70: Failed to
> relocate service:MASTERVIP; restarting locally
>
> Aug 16 16:59:01 gmhcluster2 clurgmgrd[4132]: <notice> Starting stopped
> service service:MASTERVIP
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering GATHER state
> from 0.
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Creating commit token
> because I am the rep.
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Saving state aru 90
> high seq received 90
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Storing new sequence
> id for ring 1e4
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering COMMIT
> state.
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering RECOVERY
> state.
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [0] member
> 172.30.31.4:
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480
> rep 172.30.31.3
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered
> 90 received flag 1
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] position [1] member
> 172.30.31.5:
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] previous ring seq 480
> rep 172.30.31.3
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] aru 90 high delivered
> 90 received flag 1
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Did not need to
> originate any messages in recovery.
>
> Aug 16 16:59:15 gmhcluster2 kernel: dlm: closing connection to node 1
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] Sending initial ORF
> token
>
> Aug 16 16:59:15 gmhcluster2 clurgmgrd: [4132]: <info> Executing
> /usr/sbin/dc-cluster-resourceagent start
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM CONFIGURATION
> CHANGE
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New Configuration:
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
> ip(172.30.31.4)
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
> ip(172.30.31.5)
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Left:
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
> ip(172.30.31.3)
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Joined:
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] CLM CONFIGURATION
> CHANGE
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] New Configuration:
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
> ip(172.30.31.4)
>
> Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
> INFO - ACTIVESITE in state started
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ]      r(0)
> ip(172.30.31.5)
>
> Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
> INFO - ACTIVESITE service is running - ACTIVESITE in state started
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Left:
>
> Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
> INFO - Service MASTERVIP: Performing precautionary updates to databases
> before setting up VIP
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [CLM  ] Members Joined:
>
> Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
> INFO - Service MASTERVIP: running precautionary update [touch
> /tmp/dc-set-dr-counters]
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [SYNC ] This node is within
> the primary component and will provide service.
>
> Aug 16 16:59:15 gmhcluster2 /usr/sbin/dc-cluster-resourceagent: [2492]
> INFO - Service MASTERVIP: running precautionary update [touch
> /tmp/dc-set-dr-auto-increments]
>
> Aug 16 16:59:15 gmhcluster2 openais[4084]: [TOTEM] entering OPERATIONAL
> state.
>
> Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got nodejoin message
> 172.30.31.4
>
> Aug 16 16:59:16 gmhcluster2 openais[4084]: [CLM  ] got nodejoin message
> 172.30.31.5
>
> Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got joinlist message
> from node 3
>
> Aug 16 16:59:16 gmhcluster2 clurgmgrd: [4132]: <info> Adding IPv4
> address 172.30.30.93/24 to bond0
>
> Aug 16 16:59:16 gmhcluster2 openais[4084]: [CPG  ] got joinlist message
> from node 2
>
> Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Service
> service:MASTERVIP started
>
> Aug 16 16:59:17 gmhcluster2 clurgmgrd[4132]: <notice> Stopping service
> service:MASTERVIP
>
> Aug 16 16:59:17 gmhcluster2 clurgmgrd: [4132]: <info> Removing IPv4
> address 172.30.30.93/24 from bond0
>
> Aug 16 16:59:27 gmhcluster2 clurgmgrd: [4132]: <info> Executing
> /usr/sbin/dc-cluster-resourceagent stop
>
> Aug 16 16:59:27 gmhcluster2 clurgmgrd[4132]: <notice> Service
> service:MASTERVIP is stopped
>
>
>
>
>
>
>
>
>
>
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
>
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster
>
-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20100817/2c438280/attachment.htm>