large ext3 filesystem consistantly locking itself read-only

Tue Jul 17 18:07:38 UTC 2007

   We have several large ext3 file system partitions.  One of them sets 
itself to read-only after getting journel problems.  I understand that's 
a good thing, but obviously I need to correct the problem so that it 
will stop locking itself.  Here are some details;

OS is Redhat EL4 x86_64 running on a SunFire v40z, kernel is 
2.6.9-42.0.2.ELsmp.  The disk storage in question is external, via fiber 
cable.  The fiber HBA is a Qlogic ISP2312 connected to a Qlogic San 
Switch connected to four Apple Xserve Raids.  There are 8 individual 
LUN's coming from the four XRaids, they appear on the host as 
/dev/sd[cdefghij].  Those LUNs are put into two LVM volume groups and 
then mounted from logical volumes.

   The partition in question is 8TB, about 92% full at the moment.  One 
oddity about this partition is it has a subdirectory which contains over 
2700 symbolic links to other partitions.  Here is the output from 
/var/adm/messages the last time the file system locked itself;

Jul 17 09:01:06  kernel: Info fld=0x0, Current sdd: sense key No Sense
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3): 
ext3_free_blocks_sb: bit already cleared for block 786856796
Jul 17 09:01:06  kernel: Aborting journal on device dm-3.
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3) in 
start_transaction: Readonly filesystem
Jul 17 09:01:06  kernel: Aborting journal on device dm-3.
Jul 17 09:01:06  kernel: ext3_abort called.
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3): 
ext3_journal_start_sb: Detected aborted journal
Jul 17 09:01:06  kernel: Remounting filesystem read-only
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3) in 
start_transaction: Journal has aborted
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3): 
ext3_free_blocks_sb: bit already cleared for block 786856797
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3): 
ext3_free_blocks_sb: bit already cleared for block 786856798
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3): 
ext3_free_blocks_sb: bit already cleared for block 786856799
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3): 
ext3_free_blocks_sb: bit already cleared for block 786856800
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3) in 
ext3_reserve_inode_write: Journal has aborted
Jul 17 09:01:06  kernel: EXT3-fs error (device dm-3) in ext3_truncate: 
Journal has aborted
Jul 17 09:01:07  kernel: EXT3-fs error (device dm-3) in 
ext3_reserve_inode_write: Journal has aborted
Jul 17 09:01:07  kernel: EXT3-fs error (device dm-3) in ext3_orphan_del: 
Journal has aborted
Jul 17 09:01:07  kernel: EXT3-fs error (device dm-3) in 
ext3_reserve_inode_write: Journal has aborted
Jul 17 09:01:07  kernel: EXT3-fs error (device dm-3) in 
ext3_delete_inode: Journal has aborted
Jul 17 09:01:07  kernel: __journal_remove_journal_head: freeing 
b_committed_data

   If I run fsck it does seem to repair bad blocks and clears inodes but 
of course for 8TB it takes a long time to run and the corruption only 
comes back later.

   I have considered upgrading the kernel, it could be done.  I think 
part of the problem is the large number of symbolic links on that 
partition but without evidence it will be difficult to get people to 
change it.  I also don't like the first line in the messages about 
device sdd getting a "No Sense" response to a SCSI sense key request.

   Any good advice on how to proceed would be appreciated.  I have 
looked at the dumpe2fs and debugfs tools but I don't see how to put them 
to good use in this case.

   Thomas Walker