Frequent metadata corruption with ext3 + hard power-off
Mats Ahlgren
mats_a at MIT.EDU
Sat May 19 01:42:56 UTC 2007
On Friday 18 May 2007 15:03:46 Andreas Dilger wrote:
> On May 18, 2007 09:48 -0400, Mats Ahlgren wrote:
> > Namely, I'm confused: I would guess caching simply delays the time data
gets
> > to disk, and perhaps exacerbates data being written in not-the-order it
was
> > given? But, how could this cause a problem on a journaled filesystem? if
one
> > is (theoretically) only appending to the journal, checksumming/hashing to
> > detect consistent journal entries on failure (since the last checkpoint),
and
> > only replaying consistent journal entries (which are idempotent)... then,
> > assuming all those things above work, how could caching cause massive
> > corruption of the directory tree? (Is the above an accurate model for
ext3?)
>
> One issue is that we do not YET have journal checksumming in order to detect
> the case where the commit block is written to the disk but not all of the
> disk-cached blocks in the rest of that transaction are not yet committed.
> That is where the big risk comes in for writeback cache in the device.
Yikes... (that was my best guess for what was going on)
> Ideally, the jbd layer could be notified when the transaction blocks are
> flushed from device cache before writing the commit block, but the current
> linux mechanism to do this (write barriers) sucks perforance-wise (it
> sent throughput from 180MB/s to 7MB/s when enabled in our test systems).
> It was better to just turn off write cache entirely than to use barriers.
>
> We have a patch for journal checksumming that is _right_ at the verge of
> being ready for fixing the "commit-block before transaction blocks" problem.
> In fact, in earlier testing it improved performance in some cases because
> it allows the commit block to always be sent to disk at the same time as the
> transaction blocks because we know the checksum will tell us if there were
> any blocks not written to disk.
Good to hear! It's a pity ext3 didn't have journal checksumming from its
inception, but I'm glad you guys are fixing it. This seems like a serious
problem for people who aren't aware of it.
Sincerely,
Mats
> Girish, could you post your latest tested patch here for review?
[snip]
> > On Sunday 18 March 2007 09:33:59 Theodore Tso wrote:
> > > It sounds like you have a disk which is doing very aggressive write
> > > caching. If you are using a new enough kernel (2.6.9 or greater
> > > should have this), adding "barrier=1" to your mount options should
> > > help. We should probably make this the default at this point...
> > >
> > > - Ted
>
> Cheers, Andreas
> --
> Andreas Dilger
> Principal Software Engineer
> Cluster File Systems, Inc.
>
>
More information about the Ext3-users
mailing list