Frequent metadata corruption with ext3 + hard power-off

Mats Ahlgren mats_a at MIT.EDU
Sat May 19 01:42:56 UTC 2007


On Friday 18 May 2007 15:03:46 Andreas Dilger wrote:
> On May 18, 2007  09:48 -0400, Mats Ahlgren wrote:
> > Namely, I'm confused: I would guess caching simply delays the time data 
gets 
> > to disk, and perhaps exacerbates data being written in not-the-order it 
was 
> > given? But, how could this cause a problem on a journaled filesystem? if 
one 
> > is (theoretically) only appending to the journal, checksumming/hashing to 
> > detect consistent journal entries on failure (since the last checkpoint), 
and 
> > only replaying consistent journal entries (which are idempotent)... then, 
> > assuming all those things above work, how could caching cause massive 
> > corruption of the directory tree? (Is the above an accurate model for 
ext3?)
> 
> One issue is that we do not YET have journal checksumming in order to detect
> the case where the commit block is written to the disk but not all of the
> disk-cached blocks in the rest of that transaction are not yet committed.
> That is where the big risk comes in for writeback cache in the device.
Yikes... (that was my best guess for what was going on)

> Ideally, the jbd layer could be notified when the transaction blocks are
> flushed from device cache before writing the commit block, but the current
> linux mechanism to do this (write barriers) sucks perforance-wise (it
> sent throughput from 180MB/s to 7MB/s when enabled in our test systems).
> It was better to just turn off write cache entirely than to use barriers.
> 
> We have a patch for journal checksumming that is _right_ at the verge of
> being ready for fixing the "commit-block before transaction blocks" problem.
> In fact, in earlier testing it improved performance in some cases because
> it allows the commit block to always be sent to disk at the same time as the
> transaction blocks because we know the checksum will tell us if there were
> any blocks not written to disk.
Good to hear! It's a pity ext3 didn't have journal checksumming from its 
inception, but I'm glad you guys are fixing it. This seems like a serious 
problem for people who aren't aware of it.

Sincerely,
Mats

> Girish, could you post your latest tested patch here for review?


[snip]


> > On Sunday 18 March 2007 09:33:59 Theodore Tso wrote:
> > > It sounds like you have a disk which is doing very aggressive write
> > > caching.  If you are using a new enough kernel (2.6.9 or greater
> > > should have this), adding "barrier=1" to your mount options should
> > > help.  We should probably make this the default at this point...
> > > 
> > > 						- Ted
> 
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
> 
> 





More information about the Ext3-users mailing list