[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: how to counteract slowdown

Daniel Pittman wrote:
> On Mon, 12 Nov 2001, Andrew Morton wrote:
> > Daniel Pittman wrote:
> >>
> >> ...
> >> > Postfix does lots of fsync()s.
> >>
> >> Yup. I thought one of the features of the journaled data mode was
> >> that it helped here, because the fsync() only needed to hit the
> >> journal. Can that actually block on write-out?
> >
> > It has to - that's the nature of fsync() - it can't return
> > until the file is written to non-volatile storage.
> Duh. Let me fetch that brown paper bag...
> [...]
> > So a 20-second burst of write activity will seem really quick
> > on ext2, because all the heavy lifting starts afterwards.
> Right. Whereas on ext3, the five second commit time causes the journal
> to flush transactions, which cause the hard work of moving data to it's
> final location, all in a spurt.

Lots of little spurts.   Which costs a bit beause of lost kerging

> That then makes the latency for the reads and writes that Postfix wants
> to do high, starving it for a moment as the kernel processes that write
> load.
> >From reading commit.c, it /looks/ like the journal is unlocked during
> the asynchronous write of the data blocks, so Postfix isn't being
> starved because it's waiting on the journal lock...

It will be, if it does fsync().  That has to wait for a commit
to complete.

> ...but, of course, any call to fsync() ends up doing what looks like a
> synchronous wait on the data hitting it's final location?

> Did I read the code correctly? It looks like you end up calling
> journal_force_commit for any fsync(), which sets handle->h_sync to 1,
> then does journal_stop.

Yes.  No choice there.  The only optimisation which we can
make it to allow other threads to piggyback onto the same
commit.  We do that aggressively, so with my simplistic
postfix simulator we can beat ext2 by a factor of thirty or
so.  If you choose the benchmark right :).  But it wasn't complately
> This mans that journal_stop waits in log_wait_commit for the transaction
> to be flushed out of the journal and onto it's final location?


> ...I must have that wrong. It's syncing that to the journal, not to the
> disk, right? So, fsync() results in any outstanding transaction having
> it's blocks sent to the journal and waits on that, right?

> Assuming that's so, I must be very dense, as it seems that the kjournald
> code is also writing only to the journal, not to the actual disk.

yes.  Once the buffers have been written to the journal they
are refiled for lazy writeback by kjournald.  The actual writeback
is performed by kupdate.  Unless we run out of memory, in which
case bdflush will push them to disk.  Or unless we run out of
journal space, in which case log_do_checkpoint() will write them

No matter whi writes out the checkpoint-mode buffers, we cannot
reclaim a transaction's journal space until all the buffers for
that transaction are known to have been written to their final

Possibly log_do_checkpoint() could be smarter, and not write
out all dirty buffers.  Or it could start IO on all of them,
but stop waiting on writeout once sufficient journal space has
become available.  Writing them all out gets good clustering
and hence throughput.  But introduces latency for these bursty

> That means that the second or so of write-out every five seconds is
> because I have plenty of spare RAM for buffering output, but my disks
> are poor, slow IDE devices that write slowly.

And the blocks are allocated all over the disk :(

> ...
> >> Maybe I should hack the driver for a longer (or configurable) delay
> >> in flushing and see if I can reproduce the issue...
> >
> > I doubt if it'll help, but the `HZ * 5' in journal_init_common()
> > is what to tweak.
> I except it will, actually, because it's going to increase the length of
> time between ext3 writing data to the journal and the kjournald thread
> kicking off the commit that forces those blocks to their final
> location...

The five second timeout doesn't come into play if someone if
generating lots of writes - we run out of journal space first.
And then there's fsync().


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]