[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: journal writeback interval (Re: how to counteract slowdown)



"Jeffrey W. Baker" wrote:
> 
> This is a pretty interesting discussion about the 5-second journal
> writeback interval.  I just observed my system [1] during an 'updatedb'
> which causes a lot of seeking and read activity [2] at the same time as
> a slow, sustained write.
> 
>        io
>  bi    bo
> 580     0
> 556     0
> 540     0
> 336   924
> 636     0
> 608     0
> 524     0
> 544     0
> 316  1348
> 548     0
> 288     0
>   0     0
>   0     0
>   0  2636

You'll find that ext2 does a similar thing.  Every five seconds,
kupdate wakes up and writes out blocks which have been dirty for 
more than five seconds.


> Obviously, there were a lot of missed opportunities to write some blocks
> out in those blank spaces.  I'd much rather see 500 blocks out per
> second than 2500 blocks out in one second, every five seconds.  Since
> the heads are flying all over the disks anyway, there were probably
> many, many missed opportunities for essentially free writes during that
> time.

Yes.  This is of course the early flush stuff (I forget the name)
which Daniel Philips was talking about a few months back.  He had
a prototype which I think improved things.

Now, the idea is that the longer we defer writeout, the
more opportunity there is for the data to be deleted by
application activity, and to never be written.  Also, we
get more opportunity to combine writes into larger requests,
as they come in.

Note that the normal kupdate activity actually wastes opportunities
to combine writes: it'll write out all the buffers which are
more than thirty seconds old, but there may be a juicy chunk of
mergeable buffers which are only fifteen seconds old.  We miss them.

In practice, it's probably not too important.  If the disk is
maxed out then we'll get a buildup of dirty buffers and soon,
balance_dirty() will exceed its second threshold and will wake
bdflush, which will try to write out every dirty buffer in the
machine, so we'll get good request merging from that.

Perhaps.  The dynamics are complex, and they depend upon the
ratio of amount-of-physical-memory to disk throughput.

Yes, earlier writeout may help if its clever.  A generic implementation
is tricky - down in the bowels of the request layer.  However an
ext3-specific implementation is easy.  At the end of commit, as we
refile buffers for writeback:

	mark_buffer_dirty(bh);
	if (journal_getting_full()) {
		/* uh-oh */
		ll_rw_block(WRITE, 1, &bh);
	}

that's what the three-liner which I sent to Neil did, without
the journal_getting_full() bit.  However it doesn't seem to make
much difference in testing here.  It had an SMP bug, too.  Fixed
version:

--- linux-2.4.15-pre4/fs/jbd/commit.c	Mon Nov 12 11:16:12 2001
+++ linux-akpm/fs/jbd/commit.c	Mon Nov 12 23:20:28 2001
@@ -649,7 +649,12 @@ skip_commit:
 			JBUFFER_TRACE(jh, "add to new checkpointing trans");
 			__journal_insert_checkpoint(jh, commit_transaction);
 			JBUFFER_TRACE(jh, "refile for checkpoint writeback");
+			get_bh(bh);
 			__journal_refile_buffer(jh);
+			spin_unlock(&journal_datalist_lock);
+			ll_rw_block(WRITE, 1, &bh);
+			put_bh(bh);
+			spin_lock(&journal_datalist_lock);
 		} else {
 			J_ASSERT_BH(bh, !buffer_dirty(bh));
 			J_ASSERT_JH(jh, jh->b_next_transaction == NULL);

>From a quick test, this slows things down.

Incidentally, the cost of the write to the journal is really low
(unless it's fragmented all over the place).  When we were experimenting
with the external journal code, putting the journal on a local RAM disk
made very little difference to throughput.  All the cost is in writeback.
Which is a good thing.

 
> It seems like it makes more sense to inject the write requests into the
> i/o queue whenever they are ready, if the elevator is walking up and
> down the disk anyway.  Definitely 2.5 material, but it would be very
> nice if ext3 could instruct the i/o layer to perform operations
> opportunistically if possible, and flush the remaining operations every
> five seconds.

mmm..  At present the dirty buffers are sorted into time order
for writeback.  It may make sense to sort them in block order
(which is much harder...).

Block writes (buffers) are merged into requests by the elevator.
For IDE, a request can be up to 248 sectors: 124 kbytes.  And
by default, we only allow 128(ish) requests per device before
the writer blocks.   See the flaw in this: if we have a few thousand
dirty buffers (this happens easily) then they may all be mergeable
into, say, 200 requests.  But if we block the thread which is
injecting writes as soon as 128 requests are queued, there can
be missed merge opportunities.

That being said, it's trivial to increase the 128.  This makes
quite a bit of difference with some typical ext2 workloads.
However when the block allocator is changed to not splatter
directories everywhere, it makes no difference at all, and
128 requests works fine.

hmm..  I just ran a test.

	time (tar xfz /foo/linux-2.4.14.gz ; sync)

ext3, 200 meg journal, data=journal:	26 seconds
ext2:					20 seconds

	time (tar xfz /foo/linux-2.4.14.gz)
ext3, 200 meg journal, data=journal:	25 seconds
ext2:					7 seconds

So that additional sync at the end of the ext2 operation costs
13 seconds.  That's the time taken to write all the data back
into the main fs.  It's a real cost, but it tends to be less
noticeable with ext2 for bursty loads because it happens after
the untarring has finished.    With ext3 it happens in the
middle of the benchmark.

For long-term workloads which generate more data than the
lower bdflush threshold, the 20/26 ratio is the one to look
at.   It's pretty good, actually. But yes, it'd be nice to
do something about the latency which ext3 introduces.

> [2] An aside: it seems that an 'updatedb' program which knew a lot about
> ext2's structure could be smarter.  Instead of recursively descending
> the directory structure (and thereby striding all over the disk), it
> would walk the disk from beginning to end, building an object graph in
> memory.  At the end, it could mark-and-sweep any disconnected objects,
> sort the graph, and write out the database.  This would involve only
> track-to-track seeking.  Thoughts?

Or if it forked 100 threads, each one performing reads.
That way, we get up to 100 requests in the queue, all
nicely merging together.  Having a single thread doing the
reading really hurts, because, of course, it has to wait on the
result of every single read.  Kernel-level and device-level readahead
should really help here though.

Alternatively: async IO (aio).  This allows a large number of
reads to be outstanding, with out-of-order completion.  At present
aio is implemented in glibc.  In-kernel implementations are coming.
I don't know how useful aio will be to updatedb though - something
which touches zillions of files needs an async stat(), open(), etc.
aio doesn't do that.





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]