[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Ordered Mode vs Journaled Mode



On Oct 02, 2001  17:05 -0700, Mike Fedyk wrote:
> On Tue, Oct 02, 2001 at 04:59:32PM -0600, Andreas Dilger wrote:
> > With ordered or writeback mode, it is possible that you are in the middle
> > of writing data into your file when you get a crash.  What is in the file?
> > Half new data and half old data.  
> 
> SCT, has described in a recent LKML thread that during a truncate-write
> transaction that the deleted blocks won't be written to until the
> transaction has completed.  This seems to imply that the previous data would
> be recoverable if the transaction didn't complete.

This is only true if you actually truncate the file first.  Yes, that is
normally the case, so you are pretty safe with either ordered or journaled
data.  With writeback, you just don't know if you will get garbage after
a truncate+write.

> > While this isn't 100% true (i.e. if you have very large writes they will
> > be split into multiple transactions) it is mostly true.
> 
> Hmm, this may be a candidate for the ordered mode:
> Keep track of transactions that are for the same file and make sure that the
> deleted blocks aren't used until the entire set of transactions have
> completed...

For journaled data mode it is not possible, because you could make a
single write request larger than the entire size of the journal.  For
ordered data, there is very little need to break a single write over
multiple transactions, because only the metadata goes into the journal.
Note that I may be wrong on this with 2.4 because of oddities relating
to writepage.

> > You are also protected from disk hardware problems, where they write
> > garbage to the disk in a powerfail situation.  If garbage goes to disk
> > either it is in the journal (where it is discarded) or it will be written
> > to the filesystem again at journal recovery time.
> 
> Or the garbage is written again?

No, if garbage goes to the journal, you will not get an "end of transaction"
marker in the journal.  This is not even _started_ for write until the rest
of the transaction is _finished_.  If there is garbage in the journal, it
will be discarded because it is not part of a complete breakfast, err,
transaction.  If the garbage is being written to the filesystem, then it
will be re-written with good data at journal recovery time because the
journal transaction will not be invalidated until the filesystem writes
are complete.

> What happens if the drive decides to write garbage to completed transaction
> areas?  What about surrounding blocks?

Tell it "don't do that".  The journal itself is _pretty_ safe, and blocks
being written to the filesystem have "backups" in the journal that will
be rewritten in case of a crash.  Blocks "nearby" are unprotected, but not
much can be done in this case but throw the disk out.  It shouldn't try to
start writing to blocks it wasn't supposed to.

> > Finally, data journalling is _way_ faster if you are doing synchronous I/O
> > like for a mail server.  Both the data and metadata are written in one
> > pass to the journal, so no seeking, and you can return control to the
> > application knowing the data is safe.  It can then re-order writes to the
> > fs safely, since it will be recovered in case of a crash.
> 
> How long do you get before it has to start seeking? (that may be what the
> flush time means...)  If so, you could end up with a journal flush happening
> just as a new email has come in...  Hopefully the VM would let the flush
> complete before writing to the journal again...  But then again, the
> elevator doesn't seem to work very good at balancing multiple requests...

time = min(journal size / IO rate, flush interval) I believe.  Yes, you
can still have timing problems, but you still have a chance to avoid the
seeking, since the elevator is free to reorder the writes to the fs, even
though they are "synchronous".  If you _really_ have seek problems, you
can always put the journal on another disk/NVRAM.

> > Obviously, data journaling adds a lot more overhead to a system, so it
> > isn't for everyone.  In most cases, however, you don't use your disks at
> > full speed anyways, so there is little real impact.
> 
> It would be nice to do what software raid does during a resync:
> Slowly write data out to the FS from the journal during idle, and not
> waiting until the journal fills up to flush it...

I don't _think_ it waits until the journal is full.  That's what the
flush interval is for, to start I/O into the fs for previously commited
transactions.  Andrew would know the exact details.  Possibly using some
knowledge of how busy a disk is could optimize this by flushing aggresively
when the disk is idle to free up journal space (Daniel was working on this
a bit, although not in conjunction with ext3).

Cheers, Andreas
--
Andreas Dilger  \ "If a man ate a pound of pasta and a pound of antipasto,
                 \  would they cancel out, leaving him still hungry?"
http://www-mddsp.enel.ucalgary.ca/People/adilger/               -- Dogbert





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]