Re: [Cluster-devel] [GFS2] Git tree update

Steven Whitehouse wrote:
> Hi,
> On Mon, 2007-10-29 at 11:25 +0100, Fabio Massimo Di Nitto wrote:
>> Steven Whitehouse wrote:
>>> Hi,
>>> Since Linus released -rc1 when I was on holiday last week, I've just
>>> updated the git tree. There have been quite a lot of changes in the core
>>> kernel which affect GFS2 but which were not part of the GFS2 -nmw git
>>> tree in this last merge window, so thats another reason for updating so
>>> that we have all those changes in the -nmw tree now,
>>> Steve.
>> Hi Steve,
>> I am still getting a bunch of OOPS at mount time with this very latest tree.
>> Last commit in -nwm: 9904cd0ecaf9c72de837214fa1b359d4c87220a8
>> dmesg in attachement.
>> Cheers
>> Fabio
> This is a bit odd. I've been trying to track this down, but I've not
> been able to reproduce it here. The odd thing is that it appears from
> the context that the spin lock has been unlocked by something inside the
> call to run_queue() (from gfs2_glmutex_unlock()) but I can't spot where
> that has happened.

Could it be related to the setup I am using or the kernel config?

This is a testbed so I can do virtually everything to debug and test.

The device is one 5 GB aoe disk. The cluster is a 3 node vmware workstation
machines. The config is a simple as it can be to run such cluster.

> Even more odd, the original report hit a different area of the code: the
> check for the spinlock in gfs2_demote_wake(). The reason that particular
> one stood out is that according to the stack trace it was called from
> drop_bh() where the preceding function call was to lock that very lock.
> So in that case it looks like a race with something else getting to the
> glock's spin lock in the space of the call to that function. If you are
> running with spin lock debugging, that means that the spin lock must
> have been initialised correctly, otherwise that would have flagged up
> earlier, so it looks like its been overwritten, or that the glock has
> been freed underneath it, both of which events should be impossible.

I did try to look at the code with very little luck. I could smell the race
as i found the lock right before entering the code the OOPS but I have no idea
what could have unlock it. I am traveling until the 11th of Nov with very little
network connectivity, but the next steps are to try a clean 2.6.24-* without
-nwm and start bisecting the tree to see when this problem has been introduced.
Assuming it's not aoe causing it.

> So I'm still looking into this, but slightly confused at the moment. Has
> anybody else seen anything similar? I still can't figure out why you see
> this and I don't,

Not that I know off. I doubt there are that many people testing gfs2 "crack of
the day" as the only reason I was there was to test gfs1 and told myself: what
the heck, let's give it a spin. Speaking of which gfs1 works fine so at least
the locking code in gfs2 that is shared between the two looks good.


