[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

[Cluster-devel] waiting in init.d/cman



Back in the busy days of cluster3 development, I spent a little time looking
at the issue of waiting for quorum (and other waiting/timeouts) during
init.d/cman startup.

I wanted to clean up cluster2's somewhat arbitrary approach and have explicit,
intentional behavior around what each init.d/cman step would wait for and what
it wouldn't.  Strangely, it was fence_tool join where all sorts of odd
waits/timeouts had been wedged at various times.

In untangling and fixing, I'm not sure I got it quite right.  Current behavior
is that init.d/cman runs through and completes successfully very quickly
without waiting for quorum.  This seems nice, because it can be annoying to
have init.d/cman block.  In general it works too, it just ends up delaying the
wait for quorum until some cluster-using service starts later (clvmd,
rgmanager, gfs mount).

But, I think it may be best for init.d/cman to wait explicitly for quorum.  It
would be clearer what's happening (what's delaying startup), which was one of
the cluster2 problems.  So, roughly, init.d/cman would do:

- cman_tool join, print "Joining cluster"
- qdiskd (if configured), print "Starting qdiskd"
- wait for quorum, print "Waiting for quorum"

Any reasons to not do this or do it differently?

Related to this is the broader issue of waiting and timeouts in init.d/cman.
It would be nice to not have timeouts... I think the main reason for them is
that cman has started before the ssh service, so people could never log in if
cman was stuck (we talked about this a while back and I guess decided we
couldn't move cman later in the startup.)

Here's the startup with each wait/timeout mentioned (steps 3,4 only if qdisk
is configured.)

1. cman_tool join -w -t 120
2. WAIT/120s for join to complete, in cman_tool from the -w -t 120 options
3. qdiskd
4. WAIT/20s for cman to recognize qdisk (?), in init script loop
5. WAIT/??s for quorum, new step probably via cman_tool wait -q -t ??
6. start other daemons
7. fence_tool join -w 20
8. WAIT/20s for fence domain join to complete, in fence_tool from -w 20 option

step 2: there's been some doubt about what join -w actually gives us; at a
minimum -w may be useful here to catch delayed startup errors from corosync
and to be sure it's started up enough that qdiskd can use it in step 3.
Otherwise, the wait in step 5 seems to obviate the need for waiting at all in
step 2.

step 5: this is the only wait that people will typically notice during normal
operation.  Any suggestions on a timeout here?  And if it expires should
init.d/cman exit with a failure?  (I believe that's what other timeouts
cause.)

Dave


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]