ext3 filesystem corruption - more info (in text)
Sev Binello
sev at bnl.gov
Thu Apr 13 20:40:40 UTC 2006
Sorry about all the html
Resending last message in text
Sev Binello wrote:
> Andreas Dilger wrote:
>
>>On Apr 13, 2006 10:40 -0400, Sev Binello wrote:
>>[ still HTML-only email, extracting text from HTML is getting dull ]
>>
>>
>>>Since it seemed to mount okay only 3mins earlier,<br>
>>>can we assume that it was initially uncorrupted ?<br>
>>>Or, is that not valid assumption ?<br>
>>>
>>>
>>
>>No, at mount time there is only very cursory checking done of the group
>>descriptors and superblock. The corruption reported appears to be from
>>bad indirect blocks.
>>
>>
>>
>>>Is there anything that we can check, test etc...<br>
>>>any advice, action at this point is better than waiting for the next
>>>fileystem disaster to ocurr.<br>
>>>
>>>
>>
>>Do you run with write cache enabled on your device? That can potentially
>>cause filesystem corruption even in the face of ext3 journaling, because
>>the journal atomicity guarantees are lost when the device reports a write
>>is complete on disk when it really isn't.
>>
>>
> The raid system does run with write back cache enabled.
> I don't believe the actual drives have this enabled, but I'd have to check.
>
> But we didn't actually lose power on the raid or hosts
> just the connecting switches, so we lost all communication.
> Presumably, in this situation the controller cache should have been emptied
> Is my reasoning correct here ?
>
> Either way, you are saying is best to avoid write cacheing in the future.
>
> Also, in looking and comparing error msgs in the log files
> I noticed that on the host where the corruption occurred,
> the call to abort the journal didn't seem to actually happen for an hour
> Does that have any significance ?
>
> Mar 25 14:38:52 acnlin83 kernel: Error (-5) on journal on device 08:21
> Mar 25 14:38:52 acnlin83 kernel: Aborting journal on device sd(8,33).
>
> 1hr gap
> Mar 25 15:39:19 acnlin83 kernel: ext3_abort called.
> Mar 25 15:39:19 acnlin83 kernel: EXT3-fs abort (device sd(8,33)):
> ext3_journal_start: Detected aborted journal
> Mar 25 15:39:19 acnlin83 kernel: Remounting filesystem read-only
> Mar 25 15:39:19 acnlin83 kernel: EXT3-fs error (device sd(8,33))
> in start_transaction: Journal has aborted
>
> Thanks again
> -Sev
>
>>Cheers, Andreas
>>--
>>Andreas Dilger
>>Principal Software Engineer
>>Cluster File Systems, Inc.
>>
>>
>>
>
>
> --
>
> Sev Binello
> Brookhaven National Laboratory
> Upton, New York
> 631-344-5647
> sev at bnl.gov
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users
--
Sev Binello
Brookhaven National Laboratory
Upton, New York
631-344-5647
sev at bnl.gov
More information about the Ext3-users
mailing list