[Date Prev][Date Next] [Thread Prev][Thread Next]
Fwd: ext3 for 2.4
- From: "Douglas J. Hunley" <root hunley homeip net>
- To: ext3-users redhat com
- Subject: Fwd: ext3 for 2.4
- Date: Thu, 17 May 2001 12:38:37 -0400
---------- Forwarded Message ----------
Subject: ext3 for 2.4
Date: Thu, 17 May 2001 21:20:38 +1000
From: Andrew Morton <andrewm uow edu au>
To: ext2-devel lists sourceforge net, "Peter J. Braam"
<braam mountainviewdata com>, Andreas Dilger <adilger turbolinux com>,
"Stephen C. Tweedie" <sct redhat com>
Cc: linux-fsdevel vger kernel org
Summary: ext3 works, page_launder() doesn't :)
The tree is based on the porting work which Peter Braam did. It's
in cvs in Jeff Garzik's home on sourceforge. Info on CVS is at
http://sourceforge.net/cvs/?group_id=3242 - the module name
is `ext3'. There's a README there which describes how to
apply the patchset.
Current status is: quite solid. Stress testing on x86/SMP
passes and performance in ordered data and writeback data
mode is good. Journalled data performance is, of course, so-so.
The only big issue of which I am aware is a VM livelock
on SMP, discussed below.
The patch is against 2.4.4-ac9.
- quotas appear to work OK. I'll leave them turned on
as I test things, and watch out for oddities.
It's hard to find working quota tools. Most of them
either don't want to compile and/or don't understand
ext3. Jan Kara is maintaining a set of quota tools
at http://www.sourceforge.net/projects/linuxquota/ which
work well. The current CVS tree from there seems to be
under XFS development at present and needs a couple of
patches to work against ext3 (and even ext2). I can send them
to whoever needs.
- Recovery works fine now. The bug was that I was splicing new
blocks into a file in ext3_splice_branch() *before* doing a
journal_get_write_access() on its parent's buffer. Duh.
- Four debugging fields have been removed from buffer_head.
b_alloc_transaction, etc. These were debug fields which
I couldn't find a use for in 2.4. In 2.2, these were set
in ext3_new_block() when we do a getblk() on the new block.
In 2.4, we don't do the getblk() any more...
- Some tightening of the way commit feeds buffers into the
request queues. At present, 256 buffers are fed into
ll_rw_block() before we run tq_disk. I *was* pushing
thousands down. It doesn't seem to make much difference.
Overall throughput with some benchmarks in ordered data
mode has been significantly improved by this change.
ext3 in general seems faster in 2.4 than in 2.2, presumably
because of better request merging.
Much more work needs to go into benchmarking and performance
- There's an issue with page_launder():
This is bad. It will cause ext3 to be reentered while it
has a transaction open against a different fs. This will
corrupt filesystems and can deadlock.
Making ext3_file_write() set PF_MEMALLOC wasn't suitable. It
easily causes 0-order allocation failures within generic_file_write().
The current approach to this is, in ext3_writepage(), to detect
when ext3 is being reentered and to simply *return* without
writing the page at all.
This is kludgy but should work - the only place where the fs can
be reentered via writepage() is from page_launder(), and
page_launder() doesn't wait on the page. Quotas don't use
writepage(), and reentry there is OK.
If Marcelo's `priority' argument to writepage() goes in,
this can be used in a more sensible manner.
Note that this return-if-reentered code is not related to
the VM livelock. It has a big printk in it at present..
- Some new test tools:
To simulate crashes I have added a new mount option:
mount /dev/foo /mnt/bar -t ext3 -o ro-after=NNN
When the fs is mounted this way a timer will fire after
NNN jiffies and will turn the underlying device immutable.
It does this by setting a flag which is tested in submit_bh().
For WRITE requests submit_bh() will simply call
bh_end_io(uptodate=1) and return.
There's a new ext3 ioctl() which will block the caller until the
device has gone readonly. I semi-randomly chose
#define EXT3_WAIT_FOR_READONLY _IOR('w', 1, long)
The intent here is that a controlling script will:
1: Mount the fs with ro-after=1000 (Ten seconds)
2: Start a test script (eg: dbench)
3: Block on the wait-for-readonly ioctl
4: wake up when the disk has "crashed"
5: Kill off the test script
6: Unmount the fs
7: Mount the fs (let recovery run)
8: unmount the fs
9: run e2fsck to check that the fs is sane
10: modify the ro-after parameter
11: do it all again
Scripts which do all this are in the testing/ and tools/
directories. I've been happily simulating crashes in
the middle of `dbench 12' runs for an hour now. All is well.
I think this covers everything except for verifying that the
data content of the files are sane. That can be handled with
test tools. Special code will probably be needed to simulate
crashes during truncate - with this shotgun approach the fs
tends to go immutable before *any* of the truncate has committed,
and it's as if nothing ever happened.
The `ro-after' code and submit_bh() changes are conditional
So. The big outstanding issue is the VM livelock. It only
happens on SMP. If you have a huge amount of dirty data
in the system (a big `dd if=/dev/zero of=foo' will do it),
everything goes happily for 10-20 seconds and then it freezes.
The write drop-behind code should be cutting in here.
Running dbench, same problem. We have seven runnable tasks, *all*
of them chugging away in page_launder(), none of them making any
appreciable progress. Disk throughput falls to about one LED-flicker
per three seconds.
After 10-30 seconds, things magically clear themselves and
it starts back up. And then after ten seconds it chokes again.
It could be something silly I've done. It doesn't seem to affect
other filesystems. Tomorrow will tell.
To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
the body of a message to majordomo vger kernel org
Douglas J. Hunley (Linux User #174778)
http://hunley.homeip.net/ http://linux.nf/ http://www.kicq.org/
Linux isn't at war. War involves large numbers of people making losing
decisions that harm each other in a vain attempt to lose last.
Linux is about winning.
[Date Prev][Date Next] [Thread Prev][Thread Next]