[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] gfs2 lockup



Hi,

On Mon, 2008-12-01 at 16:46 -0600, Brian Kroth wrote:
> Given the recent discussion of GFS2's stability I thought I'd chime in
> with a problem test case.
> 
> I've noticed a deadlock in the following situation:
> 
> 3 node Debian (Lenny) cluster of esx based vm nodes using either fibre
> channel or open-iscsi based storage.  Version 2.03.06 on the
> redhat-cluster-suite software, 0.80.3 openais, and 2.6.26 on the kernel.
> 
I'm not that familar with the Debian kernel, so I don't know what fixes
might have been added recently. You might find that the problem goes
away if you upgrade to a more recent kernel, however...

> cssh node1 node2 node3
> cd /gfs2/
> mkdir $HOSTNAME
> echo $HOSTNAME > $HOSTNAME/test
> rm -rf *
> 
> The last command generally deadlocks at least one of the machines.  Any
> access attempts to the /gfs2 volume simply hang.  No logs in dmesg,
> messages, etc.  On a few occasions about 24 hours later it'll get
> fenced, but usually it's just stuck indefinitely.  I haven't had a
> chance to look into this in much more depth since I had to get something
> running so I just went back to OCFS2.  I now have an opportunity to test
> with things again, so if someone would like more information or could
> possibly tell me what's wrong that would be nice.
> 
> Thanks,
> Brian
> 
The first thing to check is that you have debugfs mounted on each node.
You can then look at the glock dumps which are located
under /sys/kernel/debug/gfs2/<fsname>/glocks. There are a number of
lines in this file, each relating to a particular glock.

Lines starting G: relate to a glock, and lines below that, indented by a
single space also relate to that same glock. H: lines relate to the
holders of that glock, and if you look at the flags field which starts
f: then you can see if any of the holders are waiting for a lock (look
for the W (wait) flag). The holders are listed in order, granted holders
first (if any) and then waiting holders (if any). So the only
interesting holder in this case will be one with a W flag set thats
nearest to its associated glock.

Looking back at the associated G: line, there are various lock modes
listed. The s: field shows the current state of the glock. The t: state
shows the target state. The target state is only of interest if the l
(locked) flag is set on the glock itself (again f: is the flags field).
In that case it tells you that there is a remote lock request in
progress (i.e. a request has been sent to the DLM) to convert from the
current lock mode (s:) to the target lock mode (t:). Demote requests are
issued from the DLM when it receives a lock request which conflicts with
an existing holder. In that case, the D flag is set on the glock and the
d: field shows the state which has been requested along with the time
(in jiffies) since the demote request was received.

I know all that sounds quite complicated, but in fact its usually pretty
easy to find the cause of deadlocks. It is usually just a matter of
first tracking down holders (H:) which are first in the queue (i.e.
immediately after a G:) with the W flag set, and then looking at the
lock with the same number (the n: field of the G: line) across the
cluster to see which node is still holding that lock (i.e. s: is not UN)
and then checking the remaining flags to see why that is the case.

There is a tool which does some of this automatically, although I've not
tried it myself as I tend to use the manual method still. If you get
stuck then please file a bug (just file it against Fedora/rawhide and
mark it as Debian in the comments somewhere, so we know which kernel it
is) and attach the glock dumps to it and then we can take a look at it.

I have it on my TODO list to write this up properly at some stage and
turn it into a GFS2 debugging FAQ or something like that. At the moment
the only documentation on glocks is the
linux-2.6/Documentation/filesystems/gfs2-glocks.txt file, although thats
aimed more at developers than users, I'm afraid,

Steve.




[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]