bad inode number followed by ext3_abort and remount readonly

Wed Jun 15 13:19:43 UTC 2005

On Tue, Jun 14, 2005 at 10:26:52PM -0400, David Shaw wrote:
> On Tue, Jun 14, 2005 at 07:19:23PM -0400, Andreas Dilger wrote:
> > On Jun 14, 2005  17:14 -0400, David Shaw wrote:
> > > Jun 13 13:58:16 n202 kernel: EXT3-fs error (device sda5): ext3_get_inode_block: bad inode number: 9
> > > 
> > > This particular example is a SATA disk, but it has happened to a
> > > regular old IDE disk as well.  It is always the root partition.  The
> > > bad inode number varies (but is always either 3 or 9).  There are no
> > > other errors about the disk in the log.
> > 
> > The "bad inode number" check is only for inodes inside the "reserved inode"
> > area, namely inum < 12.  The only commonly used (=valid) inode numbers in
> > this range are the root inode (=2) and the journal inode (=8), so I suspect
> > you are getting single-bit memory errors in bit 1, or if the controller
> > is the same that would also be viewed with suspicion.  It is very likely
> > that you are getting other single-bit errors elsewhere but they are harder
> > to notice.
> 
> This is an interesting idea.  Is there any simple way this sort of bit
> flip problem could happen outside of the hardware?  I've had this
> happen on 4 different machines from 3 different vendors, 3 SATA, and 1
> IDE.  It seems almost impossible that it's a memory or controller
> error.

I have to agree with Andreas' analysis.  If you could, please send
some compressed raw e2image dump files (see the man page for e2image,
but basically we need is: "e2image -r /dev/sda5 - | bzip2 >
sda5.e2i.bz2"), taken after the disk is remounted read-only.  Then
take another e2image dump after the system has rebooted in single user
mode, but *before* running e2fsck on the filesystem.  (That way we can
check to see if the filesystem has changed between reboots --- that
could indicate hardware problems, or in-memory corruption of the
buffer cache due to some kernel bug.)  The e2fsck transcript would
also be useful, of course.

The only other possible explanation I can imagine, beyond a hardware
problem, or some strange kernel bug that no one else is seeing, is
some a bug in some program that was directly accessing the disk drive;
for example, if the bootloader attempted to update some state and
wrote that state to the wrong place on disk, or some other program
that was doing direct disk accesses, and it was always corrupting the
same block(s) in the same way.

Good luck,

						- Ted