[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: how to counteract slowdown

On Tue, 13 Nov 2001, Andrew Morton wrote:
> Daniel Pittman wrote:
>> On Mon, 12 Nov 2001, Andrew Morton wrote:
>> > Daniel Pittman wrote:


>> Right. Whereas on ext3, the five second commit time causes the
>> journal to flush transactions, which cause the hard work of moving
>> data to it's final location, all in a spurt.
> Lots of little spurts.   Which costs a bit beause of lost kerging
> opportunities.

Relatively big, actually, for the load. I see write loads of 1.5 to 2.5
*thousand* blocks being submitted for write, as per vmstat output.

That's a substantial number of 2K blocks -- 4 MB of data being submitted
to the disk system.

That /should/ see some merging. :)


>> Did I read the code correctly? It looks like you end up calling
>> journal_force_commit for any fsync(), which sets handle->h_sync to 1,
>> then does journal_stop.
> Yes.  No choice there.  

Right, because this puts it in the journal. Then it's assured of getting
to disk eventually, no matter what.

> The only optimisation which we can make it to allow other threads to
> piggyback onto the same commit. We do that aggressively, so with my
> simplistic postfix simulator we can beat ext2 by a factor of thirty or
> so. If you choose the benchmark right :). But it wasn't complately
> nutty.

It makes a lot of sense to me. For a larger system, it's probably going
to be a real win. 

>> This mans that journal_stop waits in log_wait_commit for the
>> transaction to be flushed out of the journal and onto it's final
>> location?
> yes.

Now, this is the bit that puzzles me:  if I write the following:
(assume the C is correct, please :)

int file = open("blah", O_WRONLY);
write(file, buffer, 40000);

After the fsync call, where is the content of buffer? Is it:
(a) in the journal only
(b) in the journal and on disk

This is the one bit that confuses me from your answer.


>> Assuming that's so, I must be very dense, as it seems that the
>> kjournald code is also writing only to the journal, not to the actual
>> disk.
> yes.  Once the buffers have been written to the journal they
> are refiled for lazy writeback by kjournald.  

Ah, right.

> The actual writeback is performed by kupdate. Unless we run out of
> memory, in which case bdflush will push them to disk. Or unless we run
> out of journal space, in which case log_do_checkpoint() will write
> them out.

I see. 

> No matter whi writes out the checkpoint-mode buffers, we cannot
> reclaim a transaction's journal space until all the buffers for
> that transaction are known to have been written to their final
> destination.

Sure, otherwise it wouldn't be a journal, it would be a waste of time.

> Possibly log_do_checkpoint() could be smarter, and not write
> out all dirty buffers.  Or it could start IO on all of them,
> but stop waiting on writeout once sufficient journal space has
> become available.  

The second option actually sounds good to me, but that's because it lets
real work proceed as soon as possible, but still works hard to empty
something that's been very busy recently and, thus, is likely to be so

> Writing them all out gets good clustering and hence throughput. But
> introduces latency for these bursty loads.

Yes, it would. Now, this latency would happen only if the journal got to
be so full that we couldn't fit more data, right? The code looks that
way (unless the journal is destroyed or the FS remounted, of course.)

In that case, log_do_checkpoint /isn't/ related to the slowdowns that I
see, because there just isn't 100MB of data floating around, no matter
which way I

>> That means that the second or so of write-out every five seconds is
>> because I have plenty of spare RAM for buffering output, but my disks
>> are poor, slow IDE devices that write slowly.
> And the blocks are allocated all over the disk :(

Yeah, that is a pain. I have a vague hope that the opportunistic write
code and the improved allocation packing code get reliable soon. :)


>> > I doubt if it'll help, but the `HZ * 5' in journal_init_common()
>> > is what to tweak.
>> I except it will, actually, because it's going to increase the length
>> of time between ext3 writing data to the journal and the kjournald
>> thread kicking off the commit that forces those blocks to their final
>> location...
> The five second timeout doesn't come into play if someone if
> generating lots of writes - we run out of journal space first.
> And then there's fsync().

I suspect that increasing the flush delay will help smooth the load /I/
see, but that it's relevant only because I have a specific case that it
helps: enough ram that it's better to buffer until the 30 second write
burst is done before forcing some (lazy) writeback...


The past is a foreign country: they do things differently there.
        -- L P Hartley, _The Go-Between_

[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]