[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Possible features for next major ext3 release [was: Ordered Mode vs Journaled Mode]



Hi Stephen,

On Wed, Oct 03, 2001 at 01:43:46PM +0100, Stephen C. Tweedie wrote:
> Hi,
> 
> On Tue, Oct 02, 2001 at 05:05:12PM -0700, Mike Fedyk wrote:
> > > While this isn't 100% true (i.e. if you have very large writes they will
> > > be split into multiple transactions) it is mostly true.
> > >
> > 
> > Hmm, this may be a candidate for the ordered mode:
> > Keep track of transactions that are for the same file and make sure that the
> > deleted blocks aren't used until the entire set of transactions have
> > completed...
> 
> What's a "set of transactions"?  

I'm sure I'm not taking into account all of the possible concurrent
operations that could happen, but let's take this example:

truncate $file (keep track which blocks it had before the truncate)
write 100MB to $file 
(write still in progress in journaled mode)
power off

restarting and recovering:
find uncompleted write to $file and also notice that $file was truncated
just before, now revert the truncate transaction too.

This is possible because the blocks are still reserved until a write has
completed on the file

Question: do you have to write 100MB in several K chunks with multiple write
calls?  If so, then this would be much more complicated.  

Hmm, it looks like what I'm asking for is combining a truncate and write
transaction into one...

>Do you really expect a write of a
> 100MB file to be done atomically by the kernel?!
>

It seems possible if that is the only load, or most of it, especially for a
400mb journal (max with 4k blocks).

A certain small percentage of the journal could be used for keeping track of
reserved blocks.

> > Hmm, the data could even be flushed out of the journal in *journaled* mode
> > leaving the meta-data in the journal for the related transactions while
> > still keeping the previous blocks from being overwritten...
> > 
> > What do you think?
> 
> Not a chance.  There are already perfectly good ways of achieving
> large-scale atomicity in user-space.  The rename() mechanism is one.
> Databases are another.  Remember, a transaction-in-progress needs to
> maintain both the old and the new state until it commits, so that an
> interrupted partial commit can roll back entirely to the old state.
> Keeping both sets of state for huge files basically requires the
> journal to grow arbitrarily large, and you also get complications in
> that other users on the system might be wanting their data to be
> synced to disk faster than your 5-minute megatransaction would allow.
>

True.

Though, it would be very nice to be able to try things like this when there
isn't any contention...

All you need to do (for truncate; write) is just keep the data blocks that
were alocated before to be kept reserved until so many new transactions have
completed on that file...  This of course would be unreserved when we get a
low % of free space.

> > > You are also protected from disk hardware problems, where they write
> > > garbage to the disk in a powerfail situation.  If garbage goes to disk
> > > either it is in the journal (where it is discarded) or it will be written
> > > to the filesystem again at journal recovery time.
> > 
> > Or the garbage is written again?
> 
> No, we order writes to the journal very, very carefully, and we only
> send a commit block to the disk once the disk has confirmed that all
> of the previous journal writes are safe on oxide.  If we get an
> interrupted write to the journal, then we'll never send a commit block
> and the data will never be recovered.
> 
> If the commit block itself is corrupt, that's fine: we only care about
> the signature at the start of that block.  If it is intact then the
> rest of the journal we assume to be intact too, but we don't care what
> is in the rest of the commit block.  If the commit block doesn't look
> right then it is thrown away entirely.
>

Great.

> 
> Cheers,
>  Stephen
> 

Obviously these features would only be entertained in a 1.1.0 release
(assuming that you keep with the odd/even dev/stable concept)

Mike





[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]