[Linux-cluster] gfs2.fsck bug

Wed Dec 22 19:15:43 UTC 2010

My responses inline:

> hi,
> 
> our gfs2 datasets are down; when i try to do a mount i get:
> 
> [root at DBT1 ~]# mount -a
> /sbin/mount.gfs2: node not a member of the default fence domain
> /sbin/mount.gfs2: error mounting lockproto lock_dlm
> /sbin/mount.gfs2: node not a member of the default fence domain
> /sbin/mount.gfs2: error mounting lockproto lock_dlm
> /sbin/mount.gfs2: node not a member of the default fence domain
> /sbin/mount.gfs2: error mounting lockproto lock_dlm
> /sbin/mount.gfs2: node not a member of the default fence domain
> /sbin/mount.gfs2: error mounting lockproto lock_dlm
> /sbin/mount.gfs2: node not a member of the default fence domain
> /sbin/mount.gfs2: error mounting lockproto lock_dlm
> /sbin/mount.gfs2: node not a member of the default fence domain
> /sbin/mount.gfs2: error mounting lockproto lock_dlm

This makes me think the node trying to mount your GFS FS is not currently a member of the cluster.  Check cman_tool services on all nodes, everything should be in the state NONE.  If it is not then there is prolly a membership issue.

> 
> our cluster.conf is consistent across all devices (listed below).
> 
> so i thought an fsck would fix this, then i get:
> 
> [root at DBT1 ~]# fsck.gfs2 -fnp /dev/NEWvg/NEWlvTemp
> (snippage)
> RG #4909212 (0x4ae89c) free count inconsistent: is 16846 should be
> 17157
> Resource group counts updated
> Unlinked block 8639983 (0x83d5ef) bitmap fixed.
> RG #8639976 (0x83d5e8) free count inconsistent: is 65411 should be
> 65412
> Inode count inconsistent: is 20 should be 19
> Resource group counts updated
> Pass5 complete
> The statfs file is wrong:
> 
> Current statfs values:
> blocks: 43324224 (0x2951340)
> free: 38433917 (0x24a747d)
> dinodes: 21085 (0x525d)
> 
> Calculated statfs values:
> blocks: 43324224 (0x2951340)
> free: 38466752 (0x24af4c0)
> dinodes: 21083 (0x525b)
> The statfs file was fixed.
> 
> gfs2_fsck: bad write: Bad file descriptor on line 44 of file buf.c
> 
> i read in https://bugzilla.redhat.com/show_bug.cgi?id=457557 that
> there
> is some way of fixing this with gfs2_edit - are there docs available?

There is a development version of fsck that I have had success fixing several issue with.  It can be found at:

http://people.redhat.com/rpeterso/Experimental/RHEL5.x/gfs2/

I can't comment on the gfs2_edit procedure, maybe someone else on the list can comment here if that is a better idea than the experimental gfs2 fsck.

> 
> as we've been having fencing issues, i removed two servers (DBT2/DBT3)
> from the cluster fencing, and they are not active at this time. would
> this cause the mount issues?

I see you removed the fence devices from:

 <clusternode name="DBT3" nodeid="5" votes="1">
 <fence>
 <method name="1"/>
 </fence>

If there was a fence event on this node I could see that as a cause for not being able to mount GFS.  Any time there is lost heartbeat all cluster resources will remain frozen until there is a successful fence, without a fence device you should see failed fence messages all through the logs.

> tia for any advice / guidance.
> 
> yvette
> 
> our cluster.conf:
> 
> <?xml version="1.0"?>
> <cluster alias="DBT0_DBT1_HA" config_version="85" name="DBT0_DBT1_HA">
> <fence_daemon clean_start="0" post_fail_delay="0"
> post_join_delay="1"/>
> <clusternodes>
> <clusternode name="DBT0" nodeid="1" votes="3">
> <fence>
> <method name="1">
> <device name="DBT0_ILO2"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="DBT1" nodeid="2" votes="3">
> <fence>
> <method name="1">
> <device name="DBT1_ILO2"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="DEV" nodeid="3" votes="3">
> <fence>
> <method name="1">
> <device name="DEV_ILO2"/>
> </method>
> </fence>
> </clusternode>
> <clusternode name="DBT2" nodeid="4" votes="1">
> <fence>
> <method name="1"/>
> </fence>
> </clusternode>
> <clusternode name="DBT3" nodeid="5" votes="1">
> <fence>
> <method name="1"/>
> </fence>
> </clusternode>
> </clusternodes>
> <cman/>
> <fencedevices>
> <fencedevice agent="fence_ilo" hostname="192.168.200.140" login="foo"
> name="DBT0_ILO2" passwd="foo"/>
> <fencedevice agent="fence_ilo" hostname="192.168.200.150" login="foo"
> name="DEV_ILO2" passwd="foo"/>
> <fencedevice agent="fence_ilo" hostname="192.168.200.141" login="foo"
> name="DBT1_ILO2" passwd="foo"/>
> </fencedevices>
> <rm>
> <failoverdomains/>
> <resources>
> <clusterfs device="/dev/foo0vg/foo0vol002" force_unmount="1"
> fsid="19150" fstype="gfs2" mountpoint="/foo0vol002" name="foo0vol002"
> options="data=writeback" self_fence="0"/>
> <clusterfs device="/dev/foo0vg/foo0lvvol003" force_unmount="1"
> fsid="51633" fstype="gfs2" mountpoint="/foo0vol003" name="foo0vol003"
> options="data=writeback" self_fence="0"/>
> <clusterfs device="/dev/foo0vg/foo0lvvol004" force_unmount="1"
> fsid="36294" fstype="gfs2" mountpoint="/foo0vol004" name="foo0vol004"
> options="data=writeback" self_fence="0"/>
> <clusterfs device="/dev/foo0vg/foo0vol005" force_unmount="1"
> fsid="48920" fstype="gfs2" mountpoint="/foo0vol005" name="foo0vol005"
> options="noatime,noquota,data=writeback" self_fence="0"/>
> <clusterfs device="/dev/foo1vg/foo1lvvol000" force_unmount="1"
> fsid="24235" fstype="gfs2" mountpoint="/foo0vol000" name="foo0vol000"
> options="data=ordered" self_fence="0"/>
> <clusterfs device="/dev/foo1vg/foo1lvvol001" force_unmount="1"
> fsid="34088" fstype="gfs2" mountpoint="/foo0vol001" name="foo0vol001"
> options="data=ordered" self_fence="0"/>
> </resources>
> </rm>
> <totem consensus="4800" join="60" token="10000"
> token_retransmits_before_loss_const="20"/>
> <dlm plock_ownership="1" plock_rate_limit="0"/>
> <gfs_controld plock_rate_limit="0"/>
> </cluster>
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster