[Linux-cluster] qdiskd eviction on missed writes

Thu Jan 4 16:35:42 UTC 2007

On Wed, 2007-01-03 at 23:12 -0500, danwest wrote:

> The SAN in this
> case is a high end EMC DMX, multipathed, etc...  Currently our clusters
> are set to interval="1" and tko="15" which should allow for at least 15
> seconds (a very long time for this type of storage)

"at max" 15 seconds.

> In looking at ~/cluster/cman/qdisk/main.c it seems like the following is
> taking place:
> 
> In quroum_loop {}
> 
>         1) read everybody else's status (not sure if this includes
> yourself
>         2) check for node transitions (write eviction notice if number
> of heartbeats missed > tko)
>         3) check local heuristic (if we do not meet requirement remove
> from qdisk partition and possibly reboot)
>         4) Find master and/or determine new master, etc...
>         5) write out our status to qdisk
>         6) write out our local status (heuristics)
>         7) cycle ( sleep for defined interval).  sleep() measured in
> seconds so complete cycle = interval + time for steps (1) through (6)
> 
> Do you think that any delay in steps (1) through (4) could be the
> problem?  From an architectural standpoint wouldn't it be better to have
> (6) and (7) as a separate thread or daemon?  A kernel thread like
> cman_hbeat for example?

The heuristics are checked in the background in a separate thread; the
only thing that is checked is their states.  Step 1 will take awhile
(most of any part of qdiskd).  However, steps 2-4 shouldn't.

Making the read/write separate probably will (probably) not change much
- it's all direct I/O.  You basically said it yourself: on high end
storage, this just shouldn't be a problem.  We're doing a maddening 8k
of reads and 0.5k of writes during a normal cycle every (in your case) 1
second.

So, I suspect it's a scheduling problem.  That is, it would probably be
a whole lot more effective to just increase the priority of qdiskd so
that it gets scheduled even during load spikes (E.g. use a realtime
queue; SCHED_RR?).  I don't think the I/O path is the bottleneck.

> Further in the check_transitions procedure case #2 it might be more
> helpful to clulog what actually caused this to trigger.  The current
> logging is a bit generic.

You're totally right here; the logging isn't very great at the moment.

-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070104/1582aed9/attachment.sig>