[Linux-cluster] GFS 2 node hang in rm test

Fri Dec 10 09:14:53 UTC 2004

On Thu, Dec 09, 2004 at 04:52:42PM -0500, Daniel Phillips wrote:
> On Tuesday 07 December 2004 04:38, Patrick Caulfield wrote:
> > On Mon, Dec 06, 2004 at 04:13:50PM -0800, Daniel McNeil wrote:
> > > On Mon, 2004-12-06 at 11:45, Ken Preslan wrote:
> > > > On Fri, Dec 03, 2004 at 03:08:00PM -0800, Daniel McNeil wrote:
> > >
> > > Looking at the stack trace above and dissabling dlm.ko,
> > > it looks like dlm_lock+0x319 is the call to dlm_lock_stage1().
> > > looking at dlm_lock_stage1(), it looks like it is sleeping on
> > >   down_write(&rsb->res_lock)
> > >
> > > So now I have to find who is holding the res_lock.
> >
> > That's consistent with the hang you reported before - in fact it's
> > almost certainly the same thing. My guess is thet there is a dealock
> > on res_lock somewhere . In which case I suspect it's going to be
> > easier to find that one by reading code rather than running tests.
> > res_lock should never be held for any extended period of time, but in
> > your last set of tracebacks there was nothing obviously holding it -
> > so I suspect something is sleeping with it.
> 
> Hi Patrick,
> 
> Last week I had a bug in the cluster snapshot failover code that exposed 
> a bug in dlm or libdlm I think.  My code inadvertently acquired a lock 
> twice, first in PW mode, then later in CR mode (because I wasn't 
> checking to see if it already had the PW lock).  This caused 
> dlm_unlock_wait to wait forever.  Are these locks supposed to be 
> recursive or not?  In any event, waiting forever has got to be a bug. 
> 
> It might have something to do with a lkid tangle, since I never provided 
> separate lkids for the unlock.
> 
> This should be easily reproducible.

Thanks, I'll have a look at that. It's certainly a bug in your code that's
causing it but, as you say, a hang in the library is still a bug in the library.

If you call dlm_lock() with the same parameters twice then you will get two
seperate locks - in that case the CR will wait for the PW to get out of the
way. What is really confusing the issue is that both locks are sharing a lock
status block - so it's possible that one lock is being marked INPROGRESS
immediately after the first has completed, in that case the library will have a
lot of difficulty in digging you out of your own hole I'm afraid as it can't
disentangle the status of your two locks.

Anyway, I'll see if I can see what's happening.

-- 

patrick