[Linux-cluster] 'ls' makes GFS2 to withdraw

Tue Mar 17 14:31:11 UTC 2009

Hello Steve and All,

Running gfs2_fsck gave thousands and thousands of the following errors.
However now the files that looked corrupted on the 'ls' output are presented with correct properties and file details (size, date etc...). Also the file system looks stable without errors.

.......................
Ondisk and fsck bitmaps differ at block 86432027 (0x526d91b)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 86432028 (0x526d91c)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 86432029 (0x526d91d)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 86432030 (0x526d91e)
Ondisk status is 1 (Data) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
Ondisk and fsck bitmaps differ at block 86440684 (0x526faec)
Ondisk status is 2 (Invalid) but FSCK thinks it should be 0 (Free)
Metadata type is 0 (free)
Succeeded.
RG #86376522 (0x526004a) free count inconsistent: is 45385 should be 48898
Inode count inconsistent: is 3 should be 1
Resource group counts updated
Pass5 complete
Writing changes to disk
gfs2_fsck complete

Well I guess the root cause of the corruption was that the systems were fencing each other (aka power cycling) with the file system mounted and some services already started (maybe due to <fence_daemon clean_start="1" post_fail_delay="0" post_join_delay="3"/> in cluster.conf?). 

It took me sometime to figure out how to debug the cluster, without having services starting (including the mounting of the file system).

Thank you all for your time.

Theophanis Kontogiannis

> -----Original Message-----
> From: linux-cluster-bounces at redhat.com [mailto:linux-cluster-
> bounces at redhat.com] On Behalf Of Steven Whitehouse
> Sent: Monday, March 16, 2009 5:24 PM
> To: linux clustering
> Subject: Re: [Linux-cluster] 'ls' makes GFS2 to withdraw
> 
> Hi,
> 
> Please do not use GFS2 on Centos 5.2, it is rather old. Did you try
> running fsck.gfs2 ?
> 
> The results you see look like the readdir() call has worked, but that
> the stat() call to the directory entry has failed. I'd suggest using
> Fedora at least until Centos 5.3 is available,
> 
> Steve.
> 
> On Mon, 2009-03-16 at 17:23 +0200, Theophanis Kontogiannis wrote:
> > Hello all,
> >
> >
> >
> > I have Centos 5.2, kernel  2.6.18-92.1.22.el5.centos.plus,
> > gfs2-utils-0.1.44-1.el5_2.1
> >
> >
> >
> > The cluster is two nodes, using DRBD 8.3.2 as the shared block device,
> > and CLVM over it, and GFS2 over it.
> >
> >
> >
> > After an ls in a directory within the GFS2 file system I got the
> > following errors.
> >
> >
> >
> > …………………
> >
> > GFS2: fsid=tweety:gfs2-00.0: fatal: invalid metadata block
> >
> > GFS2: fsid=tweety:gfs2-00.0:   bh = 522538 (magic number)
> >
> > GFS2: fsid=tweety:gfs2-00.0:   function = gfs2_meta_indirect_buffer,
> > file = fs/gfs2/meta_io.c, line = 332
> >
> > GFS2: fsid=tweety:gfs2-00.0: about to withdraw this file system
> >
> > GFS2: fsid=tweety:gfs2-00.0: telling LM to withdraw
> >
> > GFS2: fsid=tweety:gfs2-00.0: withdrawn
> >
> >
> >
> > Call Trace:
> >
> >  [<ffffffff885c2146>] :gfs2:gfs2_lm_withdraw+0xc1/0xd0
> >
> >  [<ffffffff800639de>] __wait_on_bit+0x60/0x6e
> >
> >  [<ffffffff80014f46>] sync_buffer+0x0/0x3f
> >
> >  [<ffffffff80063a58>] out_of_line_wait_on_bit+0x6c/0x78
> >
> >  [<ffffffff8009d0ca>] wake_bit_function+0x0/0x23
> >
> >  [<ffffffff885d3f7f>] :gfs2:gfs2_meta_check_ii+0x2c/0x38
> >
> >  [<ffffffff885c5a06>] :gfs2:gfs2_meta_indirect_buffer+0x104/0x15e
> >
> >  [<ffffffff885c095a>] :gfs2:gfs2_inode_refresh+0x22/0x2ca
> >
> >  [<ffffffff8009d0ca>] wake_bit_function+0x0/0x23
> >
> >  [<ffffffff885bfd9c>] :gfs2:inode_go_lock+0x29/0x57
> >
> >  [<ffffffff885bef04>] :gfs2:glock_wait_internal+0x1d4/0x23f
> >
> >  [<ffffffff885bf11d>] :gfs2:gfs2_glock_nq+0x1ae/0x1d4
> >
> >  [<ffffffff885cb053>] :gfs2:gfs2_lookup+0x58/0xa7
> >
> >  [<ffffffff885cb04b>] :gfs2:gfs2_lookup+0x50/0xa7
> >
> >  [<ffffffff800226dd>] d_alloc+0x174/0x1a9
> >
> >  [<ffffffff8000cbff>] do_lookup+0xe5/0x1e6
> >
> >  [<ffffffff80009fac>] __link_path_walk+0xa01/0xf42
> >
> >  [<ffffffff800c4fe7>] zone_statistics+0x3e/0x6d
> >
> >  [<ffffffff8000e7cd>] link_path_walk+0x5c/0xe5
> >
> >  [<ffffffff885bdd6f>] :gfs2:gfs2_glock_put+0x26/0x133
> >
> >  [<ffffffff8000c99e>] do_path_lookup+0x270/0x2e8
> >
> >  [<ffffffff80012336>] getname+0x15b/0x1c1
> >
> >  [<ffffffff80023741>] __user_walk_fd+0x37/0x4c
> >
> >  [<ffffffff8003ed91>] vfs_lstat_fd+0x18/0x47
> >
> >  [<ffffffff8002a9d3>] sys_newlstat+0x19/0x31
> >
> >  [<ffffffff8005d229>] tracesys+0x71/0xe0
> >
> >  [<ffffffff8005d28d>] tracesys+0xd5/0xe0
> >
> > …………………….
> >
> >
> >
> >
> >
> > Obviously ls was not the cause of the problem but it triggered the
> > events.
> >
> >
> >
> > From the other node I can have access on the directory that on which
> > the ‘ls’ triggered the above. The directory is full of files like
> > that:
> >
> >
> >
> > ?--------- ? ?     ?          ?            ? sched_reply
> >
> >
> >
> > Almost 50% of the files are in shown like that with ls.
> >
> >
> >
> > The questions are:
> >
> >
> >
> > 1.      Is this a (new) GFS2 bug?
> >
> > 2.      Is this a recoverable problem (and how)?
> >
> > 3.      After a  GFS2 file system gets withdrawn, how do we make the
> > node to use it again, without rebooting?
> >
> >
> >
> > Thank you all for your time.
> >
> >
> >
> > Theophanis Kontogiannis
> >
> >
> >
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster