[Linux-cluster] GFS2/DLM deadlock

Fri Sep 7 20:57:51 UTC 2012

----- Original Message -----
| I have a 2 node cluster ( two HP DL360G7 servers) with a shared gfs2
| file system located on an HP Modular Smart Array.
| Node1 is the 'active' server and performs almost all gfs2 access.
| Node2 is a 'passive' backup and rarely accesses the shared file
| system.
| 
| Both nodes are currently running kernel-PAE-2.6.18-274.17.1.el5.i686.
| 
| I am aware of the kernel updates available in the RedHat 5.8 release
| and have reviewed the change logs and associated bug reports, that I
| have access to, to determine if the handful of gfs2 changes might
| apply to this situation. They do not seem to apply but we plan on
| upgrading our production servers when we can to rule out that
| possibility.
| 
| Intermittently (3-4 times a month) the gfs2 file system appears to
| lock up and any processes attempting to access it enter D state.
| Networking continues to function and openais is happy so no fencing
| occurs. Power cycling the passive node breaks the deadlock and
| processing on the active node will continue.
| 
| During the last hang we ran the gfs2_hangalyzer tool, suggested in
| some older threads on the deadlock subject, to capture the dlm and
| glock info.
| 
| I can't find explanations on what some of the fields mean so I'm
| hoping someone can help me interpret the results and confirm if my
| understanding of the output is correct or offer suggestions on how to
| proceed debugging further when it happens again. So far we can't come
| up with a reproduction scenario.
| 
| I have attached the gfs2_hangalyzer summary output as hangalyzer.txt.
| I have the raw lock data as well if required.
| 
| The tool reports that there are two glocks on which processes are
| waiting but no other process holds them. So it looks like a deadlock,
| since if no process owns them, they should have been released.
| 
| The tool also reports that the two glocks were granted to two process
| IDs.
| 
| This is an excerpt from the hangalyzer output:
| 
| --------------------------------------------
| There are 2 glocks with waiters.
| node1, pid 5380 is waiting for glock 2/85187, but no holder was
| found.
|          The dlm has granted lkb "       2           85187" to pid
|          5021
| 
| 
|                       lkb_id N RemoteID  pid exflg lkbflgs stat gr rq
|   waiting n ln             resource name
| node1  : FS1:  3e00003 2  10c0002 5021     0   10000 grnt  5 -1
|   0 0 24 "       2           85187"
| node1  : FS1:  1501c6a 0        0 5380     0       0 wait -1  3
|   0 0 24 "       2           85187"
| node2  : FS1: G:  s:EX n:2/85187 f:dyq t:EX d:SH/0 l:0 a:0 r:4 m:150
| node2  :                         (pending demote, dirty, holder
| queued)
| node2  : FS1:  I: n:1711/545159 t:8 f:0x10 d:0x00000000 s:957/957
| 
|                         lkb_id N RemoteID  pid exflg lkbflgs stat gr
| rq    waiting n ln             resource name
| node2  : FS1:  10c0002 1  3e00003 5021     0       0 grnt  5 -1
|   0 1 24 "       2           85187"
| --------------------------------------------
| 
| As I understand this, on node1 the resource name "2 85187" is granted
| (grnt) to process 5021 on node2 while process 5380 is in wait mode on
| it.
| At the same time, node2 sees that resource name "2 85187" is granted
| (grnt) to process 5021 on node1.
| On node1, process ID 5021 is [glock_workqueue].
| >From 'ps axl':
| 1     0  5021    67  10  -5      0     0 worker S<   ?          0:07
| [glock_workqueue]
| 
| A similar thing occurs for resource name "2 81523".
| 
| --------------------------------------------
|                         lkb_id N RemoteID  pid exflg lkbflgs stat gr
| rq    waiting n ln             resource name
| node1  : FS1:  2f20002 2  2970001 5021    44   10000 grnt  3 -1
|   0 0 24 "       2           81523"
| node1  : FS1:  3961d2b 0        0 5022     0       0 wait -1  5
|   0 0 24 "       2           81523"
| node2  : FS1: G:  s:SH n:2/81523 f:dq t:SH d:UN/0 l:0 a:0 r:4 m:100
| node2  :                         (pending demote, holder queued)
| node2  : FS1:  I: n:126/529699 t:4 f:0x10 d:0x00000001 s:3864/3864
| 
|                         lkb_id N RemoteID  pid exflg lkbflgs stat gr
| rq    waiting n ln             resource name
| node2  : FS1:  2970001 1  2f20002 5029    44       0 grnt  3 -1
|   0 1 24 "       2           81523"
| --------------------------------------------
| 
| On node1 the resource "2 81523" is granted to process 5021 on node2,
| while local process 5022 waits on it.
| On node2, the lock appears to be granted to process 5029 from node1.
| On node1, process ID 5029 is [delete_workqueu].
| >From 'ps axl':
| 1     0  5029    67  10  -5      0     0 worker S<   ?          0:00
| [delete_workqueu]
| 
| Is my understanding of this output correct?
| Is there more info I need to try and gather to diagnose the issue
| when
| it happens again?
| 
| --
| Linux-cluster mailing list
| Linux-cluster at redhat.com
| https://www.redhat.com/mailman/listinfo/linux-cluster

Yes, it sounds like you have the basics right.

The question is: what happened to process 5021 and how did it
dequeue the glock without granting it to one of the waiters?
Did process 5021 show up in ps? If so, I'd dump its call trace
to see what it's doing. In RHEL6 that's a bit easier, for example,
cat /proc/5021/stack or some such. In RHEL/Centos 5 you can always
echo t to /proc/sysrq-trigger and check the console, although if you
don't have your post_fail_delay set high enough, it can cause your node
to get fenced during the output.

With a quick glance, I can't really see any critical patches missing
from that kernel, although there are a few possibilities;
a lot of work has been done since that version. Any chance of moving
to RHEL or Centos 6.3? Debugging these kinds of issues is easier with
RHEL6 because we have gfs2 kernel-level tracing and such, which doesn't
exist in the 2.6.18 kernels.

Regards,

Bob Peterson
Red Hat File Systems