[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Cluster-devel] waiting in init.d/cman



Hi David,

On Wed, 2009-08-05 at 11:12 -0500, David Teigland wrote:

> But, I think it may be best for init.d/cman to wait explicitly for quorum. 

I agree but it has to be optional with default to not wait for quorum.

>  It
> would be clearer what's happening (what's delaying startup), which was one of
> the cluster2 problems.  So, roughly, init.d/cman would do:
> 
> - cman_tool join, print "Joining cluster"
> - qdiskd (if configured), print "Starting qdiskd"
> - wait for quorum, print "Waiting for quorum"
> 
> Any reasons to not do this or do it differently?

I can see the possibility to block the boot for quorum when quorum might
never be available. As above, I don't mind to add that to the init
script, but it will need yet another timeout.

> Related to this is the broader issue of waiting and timeouts in init.d/cman.
> It would be nice to not have timeouts... I think the main reason for them is
> that cman has started before the ssh service, so people could never log in if
> cman was stuck (we talked about this a while back and I guess decided we
> couldn't move cman later in the startup.)
> 
> Here's the startup with each wait/timeout mentioned (steps 3,4 only if qdisk
> is configured.)
> 
> 1. cman_tool join -w -t 120
> 2. WAIT/120s for join to complete, in cman_tool from the -w -t 120 options

This is configurable so we could probably lower it a bit, but it needs
to be there. The cman_tool -> corosync startup is complex and takes
time. There is no exact moment when it finishes.

> 3. qdiskd
> 4. WAIT/20s for cman to recognize qdisk (?), in init script loop

Yes that is correct. cman will see qdisk only after qdisk has completed
it's init on disk. We wait as it can also guarantee quorum for that
node.

> 5. WAIT/??s for quorum, new step probably via cman_tool wait -q -t ??

+1 from me if everybody else agrees. In general this timeout would be
hit only when the whole cluster is booting for the first time ever.
Otherwise a one node reboot won't even see this.

> 6. start other daemons
> 7. fence_tool join -w 20
> 8. WAIT/20s for fence domain join to complete, in fence_tool from -w 20 option
> 
> step 2: there's been some doubt about what join -w actually gives us; at a
> minimum -w may be useful here to catch delayed startup errors from corosync
> and to be sure it's started up enough that qdiskd can use it in step 3.
> Otherwise, the wait in step 5 seems to obviate the need for waiting at all in
> step 2.

qdisk doesn't care if cman is not there. It will run and wait for cman
to appear.

> 
> step 5: this is the only wait that people will typically notice during normal
> operation.  Any suggestions on a timeout here?  And if it expires should
> init.d/cman exit with a failure?  (I believe that's what other timeouts
> cause.)

I'd say 20 seconds again? it seems reasonable to me.

Fabio


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]