mke2fs options for very large filesystems (and corruption!)

Wed Feb 16 21:02:50 UTC 2005

[sorry if this isn't threaded right... I just subscribed]

Theodore Ts'o wrote:
> 
> There are two reasons for the reserve.  One is to reserve space on the
> partition containing /var and /etc for log files, etc.  The other is
> to avoid the performance degredation when the last 5-10% of the disk
> space is used.  (BSD actually reserves 10% by default.)  Given that
> the cost of a 200 GB disks is under $100 USD the cost of reserving
> 10GB is approximately $5.00 USD, or approximately the cost of a large
> cup of coffee at Starbucks.
> 
> Why is does the filesystem performance degrade?  Because as the last
> blocks of a filesystem are consumed the probability of filesystem
> fragmentation goes up dramatically?  Exactly when this happens depends
> on the write patterns of the workload and the distribution of the file
> sizes on the filesystem, but on average this tends to happen when the
> utilization rises to 90-95%.

I'm working on something similar -- we're hoping to create a single
3.5 TB ext3 filesystem.

For us, 5% is 175 gig.  Since we're using hot-swap scsi disks instead of 
ide, that comes out to well over $100.  So we'll be going with 1-2%. 
Even then, we're leaving several gig available, which will probably be 
fine.  When we fill the disk, I expect we'll be more concerned with 
having no free space, and less concerned with its performance.

More importantly, I have some questions about whether ext3 can
realistically handle a filesystem this large.  Since fdisk created a
DOS partition table, and that couldn't handle such a large partition,
I used parted to create a gpt partition of 3.5TB.  We were then able
to mkfs and mount the filesystem.  As a test, I filled it with:
  cp 1_gig_file 1_gig_file.###    (where ### ranged from 000 to 999)
and
  dd if=/dev/zero of=bigfile bs=10M

I expected bigfile to crash out at 2TB, but actually it filled the
filesystem (created a 2.7TB file).  The next bit is slightly muddled:
  09:50 deleted the bigfile (this took a LONG time)
  10:04 pulled a drive on the raid array (hardware raid5)
  10:12 errors started appearing in the logs, and the filesystem
        remounted itself read-only

Here are some sample errors:

Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 1027000215, count = 1
Feb 16 10:12:05 hera kernel: Aborting journal on device sdb1.
Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 3696983906, count = 1
Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 2229194010, count = 1
Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 1172112249, count = 1
Feb 16 10:12:05 hera kernel: EXT3-fs error (device sdb1): ext3_free_blocks: Freeing blocks not in datazone - block = 3315908307, count = 1
Feb 16 10:12:05 hera kernel: ext3_free_blocks_sb: aborting transaction: Journal has aborted in __ext3_journal_get_undo_access<2>EXT3-fs error (device sdb1) in ext3_free_blocks_sb: Journal has aborted
...
Feb 16 10:12:10 hera kernel: EXT3-fs error (device sdb1) in ext3_delete_inode: Journal has aborted
Feb 16 10:12:10 hera kernel: __journal_remove_journal_head: freeing b_committed_data
Feb 16 10:12:10 hera last message repeated 5 times 
Feb 16 10:12:10 hera kernel: ext3_abort called.
Feb 16 10:12:10 hera kernel: EXT3-fs error (device sdb1): ext3_journal_start_sb: Detected aborted journal
Feb 16 10:12:10 hera kernel: Remounting filesystem read-only

My best guess of what happened here is that "bigfile" became larger
than the maximum filesize (it was a 2.7TB file, but the max filesize
for ext3 is 2.0TB).  So when we deleted it, it probably deleted data
blocks for lots of other files.  Or at least tried to... maybe these
logs indicate failure.

In any case, when we ran fsck, it had plenty of complaints.  Here's a
sample session:

[root at hera /]# fsck /dev/sdb1
fsck 1.35 (28-Feb-2004)
e2fsck 1.35 (28-Feb-2004)
/dev/sdb1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Inode 17072 has illegal block(s).  Clear<y>? yes

Illegal block #69645 (3666359908) in inode 17072.  CLEARED.
Illegal block #69646 (3602993593) in inode 17072.  CLEARED.
Illegal block #69647 (3034171662) in inode 17072.  CLEARED.
Illegal block #69649 (1280104119) in inode 17072.  CLEARED.
Illegal block #69650 (1548422666) in inode 17072.  CLEARED.
Illegal block #69651 (2791173364) in inode 17072.  CLEARED.
Illegal block #69652 (2279495497) in inode 17072.  CLEARED.
Illegal block #69653 (3744042344) in inode 17072.  CLEARED.
Illegal block #69655 (2639235121) in inode 17072.  CLEARED.
Illegal block #69656 (3190320101) in inode 17072.  CLEARED.
Illegal block #69657 (1758759856) in inode 17072.  CLEARED.
Too many illegal blocks in inode 17072.
Clear inode<y>? yes

Inode 17124 has illegal block(s).  Clear<y>? yes

Note that for each inode we cleared, we lost one of the 1_gig_files.
For example, check out this directory listing:

[root at hera newprojects]# ls -lart | more
total 835484832
?---------  ? ?    ?             ?            ? rand_gig.797
?---------  ? ?    ?             ?            ? rand_gig.788
?---------  ? ?    ?             ?            ? rand_gig.786
?---------  ? ?    ?             ?            ? rand_gig.785
?---------  ? ?    ?             ?            ? rand_gig.733
?---------  ? ?    ?             ?            ? rand_gig.653
?---------  ? ?    ?             ?            ? rand_gig.628
?---------  ? ?    ?             ?            ? rand_gig.627
?---------  ? ?    ?             ?            ? rand_gig.621
?---------  ? ?    ?             ?            ? rand_gig.583
?---------  ? ?    ?             ?            ? rand_gig.577
?---------  ? ?    ?             ?            ? rand_gig.559
?---------  ? ?    ?             ?            ? rand_gig.405
?---------  ? ?    ?             ?            ? rand_gig.393
drwxr-xr-x  3 root root       4096 Feb 14 15:29 ..
drwx------  2 root root      16384 Feb 15 15:25 lost+found
-rw-r--r--  1 root root 1073741824 Feb 15 16:37 rand_gig.3
-rw-r--r--  1 root root 1073741824 Feb 15 16:41 rand_gig.4
-rw-r--r--  1 root root 1073741824 Feb 15 16:44 rand_gig.5
...

We never let fsck finish completely, though perhaps it would have
just moved all of these to lost+found.

Anyway, there are obviously some serious problems with this setup,
and I suspect it's on the software side.  I'd appreciate hearing any
thoughts people might have on this.  Are we safe if we just don't
create files larger than 2TB in the future?  Or did something else go
wrong?

If creating a >2TB file really is possible at the user level, and can
really trash a filesystem, then this would appear to be a serious bug.

Damian Menscher
-- 
-=#| Physics Grad Student & SysAdmin @ U Illinois Urbana-Champaign |#=-
-=#| 488 LLP, 1110 W. Green St, Urbana, IL 61801 Ofc:(217)333-0038 |#=-
-=#| 4602 Beckman, VMIL/MS, Imaging Technology Group:(217)244-3074 |#=-
-=#| <menscher at uiuc.edu> www.uiuc.edu/~menscher/ Fax:(217)333-9819 |#=-
-=#| The above opinions are not necessarily those of my employers. |#=-