ext3 filesystem corruption - more info (in text)

Sev Binello sev at bnl.gov
Thu Apr 13 20:40:40 UTC 2006


Sorry about all the html
Resending last message in text

Sev Binello wrote:
> Andreas Dilger wrote:
> 
>>On Apr 13, 2006  10:40 -0400, Sev Binello wrote:
>>[ still HTML-only email, extracting text from HTML is getting dull ]
>>  
>>
>>>Since it seemed to mount okay only 3mins earlier,<br>
>>>can we assume that it was initially uncorrupted ?<br>
>>>Or, is that not valid assumption ?<br>
>>>    
>>>
>>
>>No, at mount time there is only very cursory checking done of the group
>>descriptors and superblock.  The corruption reported appears to be from
>>bad indirect blocks.
>>
>>  
>>
>>>Is there anything that we can check, test etc...<br>
>>>any advice, action at this point is better than waiting for the next
>>>fileystem disaster to ocurr.<br>
>>>    
>>>
>>
>>Do you run with write cache enabled on your device?  That can potentially
>>cause filesystem corruption even in the face of ext3 journaling, because
>>the journal atomicity guarantees are lost when the device reports a write
>>is complete on disk when it really isn't.
>>  
>>
> The raid system does run with write back cache enabled.
> I don't believe the actual drives have this enabled,  but I'd have to check.
> 
> But we didn't actually lose power on the raid or hosts
> just the connecting switches, so we lost all communication.
> Presumably, in this situation  the controller cache should have been emptied
> Is my reasoning correct here ?
> 
> Either way, you are saying is best to avoid write cacheing in the future.
> 
> Also, in looking and comparing error msgs in the log files
> I noticed that on the host where the corruption occurred,
> the call to abort the journal didn't seem to actually happen for an hour
> Does that have any significance ?
> 
>     Mar 25 14:38:52 acnlin83 kernel: Error (-5) on journal on device 08:21
>     Mar 25 14:38:52 acnlin83 kernel: Aborting journal on device sd(8,33).
> 
>     1hr gap
>        Mar 25 15:39:19 acnlin83 kernel: ext3_abort called.
>       Mar 25 15:39:19 acnlin83 kernel: EXT3-fs abort (device sd(8,33)): 
> ext3_journal_start: Detected aborted journal
>        Mar 25 15:39:19 acnlin83 kernel: Remounting filesystem read-only
>         Mar 25 15:39:19 acnlin83 kernel: EXT3-fs error (device sd(8,33)) 
> in start_transaction: Journal has aborted
>  
> Thanks again
> -Sev
> 
>>Cheers, Andreas
>>--
>>Andreas Dilger
>>Principal Software Engineer
>>Cluster File Systems, Inc.
>>
>>  
>>
> 
> 
> -- 
> 
> Sev Binello
> Brookhaven National Laboratory
> Upton, New York
> 631-344-5647
> sev at bnl.gov
> 
> 
> ------------------------------------------------------------------------
> 
> _______________________________________________
> Ext3-users mailing list
> Ext3-users at redhat.com
> https://www.redhat.com/mailman/listinfo/ext3-users


-- 

Sev Binello
Brookhaven National Laboratory
Upton, New York
631-344-5647
sev at bnl.gov




More information about the Ext3-users mailing list