[Linux-cluster] qdiskd eviction on missed writes
Lon Hohberger
lhh at redhat.com
Thu Jan 4 16:35:42 UTC 2007
On Wed, 2007-01-03 at 23:12 -0500, danwest wrote:
> The SAN in this
> case is a high end EMC DMX, multipathed, etc... Currently our clusters
> are set to interval="1" and tko="15" which should allow for at least 15
> seconds (a very long time for this type of storage)
"at max" 15 seconds.
> In looking at ~/cluster/cman/qdisk/main.c it seems like the following is
> taking place:
>
> In quroum_loop {}
>
> 1) read everybody else's status (not sure if this includes
> yourself
> 2) check for node transitions (write eviction notice if number
> of heartbeats missed > tko)
> 3) check local heuristic (if we do not meet requirement remove
> from qdisk partition and possibly reboot)
> 4) Find master and/or determine new master, etc...
> 5) write out our status to qdisk
> 6) write out our local status (heuristics)
> 7) cycle ( sleep for defined interval). sleep() measured in
> seconds so complete cycle = interval + time for steps (1) through (6)
>
> Do you think that any delay in steps (1) through (4) could be the
> problem? From an architectural standpoint wouldn't it be better to have
> (6) and (7) as a separate thread or daemon? A kernel thread like
> cman_hbeat for example?
The heuristics are checked in the background in a separate thread; the
only thing that is checked is their states. Step 1 will take awhile
(most of any part of qdiskd). However, steps 2-4 shouldn't.
Making the read/write separate probably will (probably) not change much
- it's all direct I/O. You basically said it yourself: on high end
storage, this just shouldn't be a problem. We're doing a maddening 8k
of reads and 0.5k of writes during a normal cycle every (in your case) 1
second.
So, I suspect it's a scheduling problem. That is, it would probably be
a whole lot more effective to just increase the priority of qdiskd so
that it gets scheduled even during load spikes (E.g. use a realtime
queue; SCHED_RR?). I don't think the I/O path is the bottleneck.
> Further in the check_transitions procedure case #2 it might be more
> helpful to clulog what actually caused this to trigger. The current
> logging is a bit generic.
You're totally right here; the logging isn't very great at the moment.
-- Lon
-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 189 bytes
Desc: This is a digitally signed message part
URL: <http://listman.redhat.com/archives/linux-cluster/attachments/20070104/1582aed9/attachment.sig>
More information about the Linux-cluster
mailing list