[Linux-cluster] GFS2 fatal: invalid metadata block

Mon Sep 28 09:15:44 UTC 2009

Hi,

You seem to have a number of issues here....

On Fri, 2009-09-25 at 12:31 -0600, Kai Meyer wrote:
> Sorry for the slow response. We ended up (finally) purchasing some RHEL 
> licenses to try and get some phone support for this problem, and came up 
> with a plan to salvage what we could. I'll try to offer a brief history 
> of the problem in hope you can help me understand this issue a little 
> better.
> I've posted the relevant logfile entries to the events described here : 
> http://kai.gnukai.com/gfs2_meltdown.txt
> All the nodes send syslog to a remote server named pxe, so the combined 
> syslog for all the nodes plus the syslog server is here: 
> http://kai.gnukai.com/messages.txt
> We started with a 4 node cluster (nodes 1, 2, 4, 5). The GFS2 filesystem 
> was created with the latest CentOS 5.3 had to offer when it was 
> released. Node 3 was off at the time the errors occurred, and not part 
> of the cluster.
> First issue I can recover from syslog is from node 5 (192.168.100.105) 
> on Sep 8 14:11:27 was a 'fatal: invalid metadata block' error that 
> resulted in the file system being withdrawn.

Ok. So lets start with that message. Once that message has appeared, it
means that something on disk has been corrupted. The only way in which
that can be fixed is to unmount on all nodes and run fsck.gfs2 on the
filesystem. The other nodes will only carry on working until they too
read the same erroneous block.

These issues are usually very tricky to track down. The main reason for
that is that the event which caused the corruption is usually long in
the past before the issue is discovered. Often there has been so much
activity that its impossible to attribute it to any particular event.

That said, we are very interested to receive reports of such corruption
in case we can figure out the common factors between such reports.

The current behaviour of withdrawing a node in the event of a disk error
is not ideal. In reality there is often little other choice though, as
letting the node continue to operate risks possible greater corruption
of data due to the potential for it to be working on incorrect data from
the original problem.

On recent upstream kernels we've tried to be a bit better about handling
such errors by turning off use of individual resource groups in some
cases, so that at least some filesystem activity can carry on.

> Next was node 4 (192.168.100.104) to hit a 'fatal: filesystem 
> consistency error' that also resulted in the file system being 
> withdrawn. On the systems themselves, any attempt to access the 
> filesystem would result in a I/O error response. At the prospect of 
> rebooting 2 of the 4 nodes in my cluster, I brought node 3 
> (192.168.100.103) online first. Then I power cycled nodes 4 and 5 one at 
> a time and let them come back online. These nodes are running Xen, so I 
> start to bring the VMs that were on nodes 4 and 5 online on nodes 3-5 
> after all 3 had joined the cluster.
> Shortly thereafter, node 3 encounters the 'fatal: invalid metadata 
> block', and withdraws the file system. Then node 2 (.102) encounters 
> 'fatal: invalid metadata block' also, and withdraws the filesystem. So I 
> reboot them.
> During their reboot, nodes 1 (.101) and 5 hits the same 'fatal: invalid 
> metadata block' error. I waited for nodes 2 and 3 to come back online to 
> preserve the cluster. At this point, node 4 was the only node that still 
> had the filesystem mounted. After I had rebooted the other 4 nodes, none 
> of them could mount the files system after joining the cluster, and node 
> 4 was spinning on the error:
> Sep  8 16:54:22 192.168.100.104 kernel: GFS2: 
> fsid=xencluster1:xenclusterfs1.0: jid=4: Trying to acquire journal lock...
> Sep  8 16:54:22 192.168.100.104 kernel: GFS2: 
> fsid=xencluster1:xenclusterfs1.0: jid=4: Busy
> It wasn't until this point that we suspected the SAN. We discovered that 
> the SAN had marked a drive as "failed" but did not remove it from the 
> array and begin to rebuild on the hot spare. When we physically removed 
> the failed drive, the hot spare was picked up and put into the array.
> The VMs on node 4 were the only ones "running" but they had all crashed 
> because their disk was unavailable. I decided to reboot all the nodes to 
> try and re-establish the cluster. We were able to get all the VMs turned 
> back on, and we thought we were out of the dark, with the exception of 
> the high level of filesystem corruption we caused inside 30% of the VM's 
> filesystems. We ran them through their ext3 filesystem checks, and got 
> them all running again.
> 
ext3 or gfs2? I assume you mean the latter

> Then at the time I send the original email, we were encountering the 
> same invalid metadata block errors on the VMs at different points.
> 
> With Redhat on the phone, we decided to migrate as much data as we could 
> from the original production SAN to a new SAN, and bring the VMs online 
> on the new SAN. There were a total of 3 VM disk images that would not 
> copy because they would trigger the invalid metadata block error every 
> time. After the migration, we tried 3 filesystem checks, all of which 
> failed, leaving the fsck_dlm mechanism configured on the filesystem. We 
> were able to override the lock with the instructions here:
> http://kbase.redhat.com/faq/docs/DOC-17402
> 
Was that reported as a bugzilla? fsck.gfs2 should certainly not fail int
that way. Although, bearing in mind what you've said about bad hardware,
that might be the reason. 

> We were able to remount the gfs2 filesystem again, but with out any 
> improvement. If we tried to copy those same files, it would withdraw the 
> filesystem. We were able to mount the disk images and recover some 
> files, and we feel like we got lucky that none of those files also 
> triggered the filesystem withdraw.
> 
> Today, we feel fairly confident that the underlying issue began with the 
> disk being marked as failed, but not being removed from the array. We've 
> contacted the hardware vendor of the SAN, but the only response they 
> offered was, "That shouldn't have happened, and it shouldn't happen again."
> 
> 
> I am very interested in your response, but at this point there isn't any 
> urgency. The old production SAN is in the lab, and we are running smart 
> tests on each of the disks to see if any of the disks are salvageable. 
> The new SAN we put into production has all new drives (different make an 
> model), and we hope we don't encounter any further issues.
> 
> The last issue I want to investigate is phasing out all my CentOS 5.3 
> servers, and installing RHEL 5.4 servers one at a time, so I don't have 
> to take down the cluster. The intent is to live migrate all the VMs to 
> the RHEL 5.4 servers to keep the availability on the VMs as high as 
> possible, but that's another topic I'll probably post about later after 
> some testing in the lab.
> 
> -Kai Meyer
> 
In theory that should work. In reality I'd certainly suggest testing
carefully before attempting that. Using RHEL 5.4 is certainly a better
option. I've no idea how many of our bug fixes CentOS has picked up,

Steve.

> Steven Whitehouse wrote:
> > Hi,
> >
> > On Sat, 2009-09-19 at 05:16 -0600, Kai Meyer wrote:
> >   
> >> I have a 5 node cluster running kernel 2.6.18-128.1.6.el5xen and 
> >> gfs2-utils-0.1.53-1.el5_3.3 . Twice in 10 days, each node in my cluster 
> >> has failed with the same message in /var/log/messages. dmesg reports the 
> >> same errors, and on some nodes there are no other entries previous to 
> >> the invalid metadata block error.
> >>
> >> I would like to know what issues can trigger such an event. If it is 
> >> more helpful for me to provide more information, I will be happy to, I'm 
> >> just not sure what other information you would consider relevant.
> >>
> >> Thank you for your time,
> >> -Kai Meyer
> >>
> >>     
> > It means that the kernel was looking for an indirect block, but instead
> > found something that was not an indirect block. The only way to fix this
> > is with fsck (after unmounting on all nodes) otherwise the issue is
> > likely to continue to occur each time you access the particular inode
> > with the problem.
> >
> > There have been a couple of reports of this (or very similar) issues
> > recently. The problem in each case is that the original issue probably
> > happened some time before it triggered the message which you've seen.
> > That means that it is very tricky to figure out exactly what the cause
> > is.
> >
> > I'd be very interested to know whether this filesystem was a newly
> > created gfs2 filesystem or an upgraded gfs1 filesystem. Also, whether
> > there have been any other issues, however minor, which might have caused
> > a node to be rebooted or fenced since the filesystem was created? Also,
> > any other background information about the type of workload that was
> > being run on the filesystem would be helpful too.
> >
> > Steve.
> >
> >
> >   
> >> Sep 19 02:02:06 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.1: fatal: invalid metadata block
> >> Sep 19 02:02:06 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.1:   bh = 567447963 (magic number)
> >> Sep 19 02:02:06 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.1:   function = 
> >> gfs2_meta_indirect_buffer, file = fs/gfs2/meta_io.c, line
> >> = 334
> >> Sep 19 02:02:06 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.1: about to withdraw this file system
> >> Sep 19 02:02:06 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.1: telling LM to withdraw
> >> Sep 19 02:02:07 192.168.100.104 kernel: GFS2: 
> >> fsid=xencluster1:xenclusterfs1.1: withdrawn
> >> Sep 19 02:02:07 192.168.100.104 kernel: 
> >> Sep 19 02:02:07 192.168.100.104 kernel: Call Trace:
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff885154ce>] 
> >> :gfs2:gfs2_lm_withdraw+0xc1/0xd0
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff80262907>] 
> >> __wait_on_bit+0x60/0x6e
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff80215788>] 
> >> sync_buffer+0x0/0x3f
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff80262981>] 
> >> out_of_line_wait_on_bit+0x6c/0x78
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8029a01a>] 
> >> wake_bit_function+0x0/0x23
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8021a7f1>] 
> >> submit_bh+0x10a/0x111
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff885284a7>] 
> >> :gfs2:gfs2_meta_check_ii+0x2c/0x38
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff88518d30>] 
> >> :gfs2:gfs2_meta_indirect_buffer+0x104/0x160
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff88509fc3>] 
> >> :gfs2:gfs2_block_map+0x1dc/0x33e
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8021a821>] 
> >> poll_freewait+0x29/0x6a
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8850a199>] 
> >> :gfs2:gfs2_extent_map+0x74/0xac
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8850a2ce>] 
> >> :gfs2:gfs2_write_alloc_required+0xfd/0x122
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff885128d5>] 
> >> :gfs2:gfs2_glock_nq+0x248/0x273
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8851a27c>] 
> >> :gfs2:gfs2_write_begin+0x99/0x36a
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8851bd1b>] 
> >> :gfs2:gfs2_file_buffered_write+0x14b/0x2e5
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8020d3a5>] 
> >> file_read_actor+0x0/0xfc
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8851c151>] 
> >> :gfs2:__gfs2_file_aio_write_nolock+0x29c/0x2d4
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8851c2f4>] 
> >> :gfs2:gfs2_file_write_nolock+0xaa/0x10f
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8022eca0>] 
> >> __wake_up+0x38/0x4f
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff80299fec>] 
> >> autoremove_wake_function+0x0/0x2e
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8022fbe4>] 
> >> pipe_readv+0x38e/0x3a2
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff80263bce>] 
> >> lock_kernel+0x1b/0x32
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8851c444>] 
> >> :gfs2:gfs2_file_write+0x49/0xa7
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff80216da9>] 
> >> vfs_write+0xce/0x174
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff802175e1>] 
> >> sys_write+0x45/0x6e
> >> Sep 19 02:02:07 192.168.100.104 kernel:  [<ffffffff8025f2f9>] 
> >> tracesys+0xab/0xb6
> >>
> >> --
> >> Linux-cluster mailing list
> >> Linux-cluster at redhat.com
> >> https://www.redhat.com/mailman/listinfo/linux-cluster
> >>     
> >
> > --
> > Linux-cluster mailing list
> > Linux-cluster at redhat.com
> > https://www.redhat.com/mailman/listinfo/linux-cluster
> >   
> 
> 
> --
> Linux-cluster mailing list
> Linux-cluster at redhat.com
> https://www.redhat.com/mailman/listinfo/linux-cluster