[Cluster-devel] GFS2: Umount recovery race fix

Tue Aug 11 08:42:39 UTC 2009

Hi,

On Mon, 2009-08-10 at 17:31 -0500, David Teigland wrote:
> On Thu, May 14, 2009 at 02:13:17PM +0100, Steven Whitehouse wrote:
> > 
> > This patch fixes a race condition where we can receive recovery
> > requests part way through processing a umount. This was causing
> > problems since the recovery thread had already gone away.
> 
> Do you have some logs showing specifically what happened in both kernel and
> userland?
> 
Yes, the one you sent to me on Fri, 8 May 2009 11:34:54 -0500 (17:34
BST). Next time please file a bugzilla so that we have a proper record
of the issues.

> > Looking in more detail at the recovery code, it was really trying
> > to implement a slight variation on a work queue, and that happens to
> > align nicely with the recently introduced slow-work subsystem. As a
> > result I've updated the code to use slow-work, rather than its own home
> > grown variety of work queue.
> > 
> > When using the wait_on_bit() function, I noticed that the wait function
> > that was supplied as an argument was appearing in the WCHAN field, so
> > I've updated the function names in order to produce more meaningful
> > output.
> 
> That description doesn't explain how the specific bug was fixed.
> 
The bug was fixed by not allowing recovery on a filesystem after a
umount has occurred.

> I'm guessing that this is the patch that broke gfs2 recovery, although there
> are others that muck around with the sysfs control files.
> 
> This is what appears in /var/log/messages,
> 
> gfs_controld[7901]: start_journal_recovery 3 error -1
> 
> And from the daemon debug log,
> 
> 1249942342 foo start_journal_recovery jid 3
> 1249942342 foo set /sys/fs/gfs2/bull:foo/lock_module/recover to 3
> 1249942342 foo set open /sys/fs/gfs2/bull:foo/lock_module/recover error -1 13
> 1249942342 start_journal_recovery 3 error -1
> 
> Dave
> 
I'll have a look - EPERM (error = 13) is not one of the errno values
which the recover code returns though,

Steve.