[Linux-cluster] GFS 1.04: fatal: assertion "x <= length" failed

Mon Oct 8 20:26:56 UTC 2007

On Mon, 2007-10-08 at 21:33 +0200, Frederik Schueler wrote:
> Hello,
> 
> I just got a crash on a gfs share:
> 
> GFS: fsid=beta:helium.1: fatal: assertion "x <= length" failed
> GFS: fsid=beta:helium.1:   function = blkalloc_internal
> GFS: fsid=beta:helium.1:   file = /usr/src/modules/redhat-cluster/gfs/gfs/rgrp.c, line = 1458
> GFS: fsid=beta:helium.1:   time = 1191842568
> GFS: fsid=beta:helium.1: about to withdraw from the cluster
> GFS: fsid=beta:helium.1: waiting for outstanding I/O
> GFS: fsid=beta:helium.1: telling LM to withdraw
> lock_dlm: withdraw abandoned memory
> GFS: fsid=beta:helium.1: withdrawn
> 
> 
> the system is running gfs 1.04 with linux 2.6.21.
> 
> after the crash, I rebooted the concerned node and run an fsck on
> another node to check the filesystem in question, and now it has a dozen
> of lost files in l+f.
> 
> How can I debug the issue? 
> 
> Best regards
> Frederik Schüler

Hi Frederik,

This is odd.  What it means is this:  GFS was searching for a free block
to allocate.  The resource group ("RG"--not to be confused with
rgmanager's resource groups) indicated at least one free block for that
section of the file system, but there were no free blocks to be found in
the bitmap for that section (a direct contradiction).  Therefore, the
file system was determined to be corrupt.

It's nearly impossible to say how this could have happened.  Here are a
few possibilities:

(1) It's possible that some rogue kernel module overwrote the bitmap memory.
(2) This can also happen if gfs_fsck is run on a file system that is already
mounted from another node.  (3) Another possibility is a hardware problem
with your media--the hard drives, FC switch, HBAs, etc.  This could
happen, for example, if GFS read the bitmap(s) from disk and the disk
returned the wrong information.  We've seen a lot of that, and the best
thing to do is test the media (but it's a tedious and sometimes
destructive task).  For more information on that, see:

http://sources.redhat.com/cluster/faq.html#gfs_corruption

(4) It's also possible--although unlikely--it could be a GFS bug, although
as far as I know, you're the only person to report such a thing.

If it is really a GFS bug, the best way to solve it (and sometimes the
only way to solve it) is to give us a way to recreate the corruption
using a clean file system and a recreation program.
If that's not possible, you could describe what was happening to the
file system at the time of failure, in as much detail as possible, and
we can do some experiments here.  For example: were there lots of
file renames going on?  directory renames?  file creates?  What kind of
IO was happening to the file system at the time?  But doing these
experiments is often just a waste of time.

If you had not run gfs_fsck, we might have been able to tell a little
bit more about what happened from the contents of the journals.
For example, in RHEL5 and equivalent, you can use gfs2_edit to save
off the file system metadata and send it in for analysis.
(gfs2_edit can operate on gfs1 file systems as well as gfs2).  However,
Since gfs_fsck clears the journals, that information is now long gone.

I hope this helps.

Regards,

Bob Peterson
Red Hat Cluster Suite