[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Ordered Mode vs Journaled Mode



Hi,

On Tue, Oct 02, 2001 at 05:05:12PM -0700, Mike Fedyk wrote:

> SCT, has described in a recent LKML thread that during a truncate-write
> transaction that the deleted blocks won't be written to until the
> transaction has completed.  This seems to imply that the previous data would
> be recoverable if the transaction didn't complete.

Not really.  Think about the syscalls involved: the application is
probably going to do something like

	fd = open(filename, O_RDWR | O_TRUNC, 0666);
	for (...) {
		write(fd, buffer, pagesize);
	}

Now, each syscall is a separate transaction as far as ext3's atomicity
guarantees are concerned.   So, the initial truncate of the file will
either discard all of the data, or none of it, and on recovery, if the
truncate is wound back then all of the data is still intact.

However, if the truncate completes but you crash before the writes
commit, you still get an incomplete file after the crash.  To avoid
that, you'd have to do something a little different:

	fd = open(new_filename, O_RDWR | O_TRUNC, 0666);
        for (...) {
                write(fd, buffer, pagesize);
        }
	rename(new_filename, old_filename);

*Now* you are guaranteed that old_filename is intact in all
circumstances (except in the fast-and-loose writeback mode, in which
it's possible that the data is not intact on disk after you crash.)

> > While this isn't 100% true (i.e. if you have very large writes they will
> > be split into multiple transactions) it is mostly true.
> >
> 
> Hmm, this may be a candidate for the ordered mode:
> Keep track of transactions that are for the same file and make sure that the
> deleted blocks aren't used until the entire set of transactions have
> completed...

What's a "set of transactions"?  Do you really expect a write of a
100MB file to be done atomically by the kernel?!

> Hmm, the data could even be flushed out of the journal in *journaled* mode
> leaving the meta-data in the journal for the related transactions while
> still keeping the previous blocks from being overwritten...
> 
> What do you think?

Not a chance.  There are already perfectly good ways of achieving
large-scale atomicity in user-space.  The rename() mechanism is one.
Databases are another.  Remember, a transaction-in-progress needs to
maintain both the old and the new state until it commits, so that an
interrupted partial commit can roll back entirely to the old state.
Keeping both sets of state for huge files basically requires the
journal to grow arbitrarily large, and you also get complications in
that other users on the system might be wanting their data to be
synced to disk faster than your 5-minute megatransaction would allow.

> > You are also protected from disk hardware problems, where they write
> > garbage to the disk in a powerfail situation.  If garbage goes to disk
> > either it is in the journal (where it is discarded) or it will be written
> > to the filesystem again at journal recovery time.
> 
> Or the garbage is written again?

No, we order writes to the journal very, very carefully, and we only
send a commit block to the disk once the disk has confirmed that all
of the previous journal writes are safe on oxide.  If we get an
interrupted write to the journal, then we'll never send a commit block
and the data will never be recovered.

If the commit block itself is corrupt, that's fine: we only care about
the signature at the start of that block.  If it is intact then the
rest of the journal we assume to be intact too, but we don't care what
is in the rest of the commit block.  If the commit block doesn't look
right then it is thrown away entirely.

> What happens if the drive decides to write garbage to completed transaction
> areas?  What about surrounding blocks?

If the drive randomly scribbles over other data then you are screwed
and should send it back to the manufacturer.

> > Obviously, data journaling adds a lot more overhead to a system, so it
> > isn't for everyone.  In most cases, however, you don't use your disks at
> > full speed anyways, so there is little real impact.
 
> It would be nice to do what software raid does during a resync:
> Slowly write data out to the FS from the journal during idle, and not
> waiting until the journal fills up to flush it...

That's what we do --- there's a timeout which controls how long we
wait after a transaction starts before beginning the background
commit.  If the journal is large enough, all the commits will be
happening in this lazy background manner, without any journal stalls
due to the journal filling up.

Cheers,
 Stephen





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]