[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: Problems with ext3 fs



Following up on this for the sake of posterity. I know what it's like to
follow a promising thread only to see it dead-end without a result :P

I can consistently and immediately reproduce filesystem corruption with
my setup (dual xeon,ext3,RH 2.4.9-31smp) when the root partition uses md
raid-1 and I run fsck -f on it. I can completely stop all previous raid
devices and reformat the drives, do a fresh redhat 7.2 install with the
root drive mirrored, touch "/forcefsck", and on the very next reboot
fsck will fail hard enough to dump into single user mode. All the files
that were deleted (including temp files) during the previous uptime are
showing up with various inode problems. Just for kicks, if I "fsck -f
-n" the individual partitions, not /dev/md*, they show up completely
error-free. The fsck on the mirrored root drive appears to be
destructive. After running it between 2-4 times bad things happen.. X
doesn't start, logs don't open, etc. If you do not run fsck -f on the
mirrored root filesystem, it appears to work fine.

This whole investigation actually started because I fielded a mirrored
root system at a location with bad power and a bad UPS, and the machine
went down a few times. When a redhat machine shuts down uncleanly it
asks if you want to run a manual fsck (ie, fsck -f). The person on site
said yes, which killed the box, which left me scratching my head for
weeks why a power outage would cause filesystem corruption under ext3.

So, this is has been a real eye-opening experience for me. Is there any
way to detect ext3 filesystem corruption on a mirrored root other than
using e2fsck? Should I be confident enough to trust the journal to do
its job?

Thanks for everyone's advice and patience,
-D

On Thu, 2002-04-25 at 16:49, Andreas Dilger wrote:
> On Apr 25, 2002  14:45 -0400, Darrell Michaud wrote:
> > I did not spend too much time dumbing down the DMA, mostly because the
> > consequences of that, even if it worked, would be unacceptable. 
> 
> But at least it would tell you (us) if that is the source of the
> corruption, and then you could start looking for a fix for the DMA
> problem.  If you _do_ have the opportunity to test this, please
> let us know.  In most cases, you probably won't notice the change
> in performance between DMA and non-DMA anyways.  It is only really
> noticable with very large files (sustained throughput and increased
> CPU usage), because I/O time is dominated by seeking anyways.
> 
> > I did try using a "normal" partition instead of the software raid,
> > however, and that has resolved the corruption completely. I'm still
> > hoping that there's a workaround that would allow the use of software
> > raid with these drives on this chipset (i860) with ext3, but for the 
> > time being a non-raid setup is ok.
> 
> Hmm, that is the first time that I have heard it reported that MD RAID
> is a source of problems on 2.4 kernels.  It is true that on 2.2 kernels
> you should not mix MD RAID and ext3 (or reiserfs), but I _thought_ it
> was OK on 2.4.  Maybe a bug has crept in somewhere.
> 
> > I don't *think* that the corruption was related to the use of fsck,
> > because I only started checking the other 7 systems after extensively
> > testing one. There's always the chance that it could be a cruel truth,
> > however :P
> 
> Well, the reason fsck might cause problems is if you run it on the
> root filesystem (which is mounted at the time fsck is run) and for
> some reason there is a discrepency between what the kernel sees,
> and what e2fsck reads/writes to the device (i.e. cache coherency
> issues).  Over time you would get pages dropped from RAM and re-read
> from disk, and what e2fsck put on the disk is not what the kernel
> was expecting.  Again, this is just a possible cause of this problem...
> 
> > Yes, you're right. e2fsck does run all the time and "recover the
> > journal" without mention of any problems. However, if I boot from a
> > different raid/ext3-aware boot/root source and perform a full, manual
> > fsck on the root filesystem it will tell me that there are errors that
> > need to be repaired. The amount and severity of the errors grow with
> > time, but are not detected by the "journal" fsck.  
> 
> Well a "journal" fsck is not an fsck at all, really.  It's just e2fsck
> seeing that the journal has data, writing the data to the filesystem,
> and then checking the superblock and finding it marked "clean" (as it
> always is with ext3, unless there is an error.
> 
> > > Both of you are using MD RAID.  Is there a possibility to disable MD
> > > raid on the root device and see if this fixes things?  This is obviously
> > > a lot easier to do on the mirrored root filesystem (change back to using
> > > one of the raw devices instead of the MD device, and disable MD for that
> > > device, including RAID autostart where you need to change the partition
> > > type).
> > 
> > This was going to be my next troubleshooting step.. Wish me luck :P
> 
> Well, you just said above that using a "normal" partition resolves the
> problem completely...
> 
> Cheers, Andreas
> --
> Andreas Dilger
> http://www-mddsp.enel.ucalgary.ca/People/adilger/
> http://sourceforge.net/projects/ext2resize/






[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]