[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] gfs deadlock situation



Hi Wendy,

thanks for your answer!
The system is still in the deadlock state, so I hopefully can collect all 
information you need :-) (you'll find the crash output below)

Thanks,

Mark

> > we have the following deadlock situation:
> >
> > 2 node cluster consisting of node1 and node2.
> > /usr/local is placed on a GFS filesystem mounted on both nodes.
> > Lockmanager is dlm.
> > We are using RHEL4u4
> >
> > a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
> > lstat("/usr/local/swadmin/mnx/xml",
> >
> > This happens on both cluster nodes.
> >
> > All processes trying to access the directory /usr/local/swadmin/mnx/xml
> > are in "Waiting for IO (D)" state. I.e. system load is at about 400 ;-)
> >
> > Any ideas ?
>
> Quickly browsing this, look to me that process with pid=5856 got stuck.
> That process had the file or directory (ino number 627732 - probably
> /usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for
> it. The faulty process was apparently in the middle of obtaining another
> exclusive lock (and almost got it). We need to know where pid=5856 was
> stuck at that time. If this occurs again, could you use "crash" to back
> trace that process and show us the output ?

Here's the crash output:

crash> bt 5856
PID: 5856   TASK: 10bd26427f0       CPU: 0   COMMAND: "java"
 #0 [10bd20cfbc8] schedule at ffffffff8030a1d1
 #1 [10bd20cfca0] wait_for_completion at ffffffff8030a415
 #2 [10bd20cfd20] glock_wait_internal at ffffffffa018574e
 #3 [10bd20cfd60] gfs_glock_nq_m at ffffffffa01860ce
 #4 [10bd20cfda0] gfs_unlink at ffffffffa019ce41
 #5 [10bd20cfea0] vfs_unlink at ffffffff801889fa
 #6 [10bd20cfed0] sys_unlink at ffffffff80188b19
 #7 [10bd20cff30] filp_close at ffffffff80178e48
 #8 [10bd20cff50] error_exit at ffffffff80110d91
    RIP: 0000002a9593f649  RSP: 0000007fbfffbca0  RFLAGS: 00010206
    RAX: 0000000000000057  RBX: ffffffff8011026a  RCX: 0000002a9cc9c870
    RDX: 0000002ae5989000  RSI: 0000002a962fa3a8  RDI: 0000002ae5989000
    RBP: 0000000000000000   R8: 0000002a9630abb0   R9: 0000000000000ffc
    R10: 0000002a9630abc0  R11: 0000000000000206  R12: 0000000040115700
    R13: 0000002ae23294b0  R14: 0000007fbfffc300  R15: 0000002ae5989000
    ORIG_RAX: 0000000000000057  CS: 0033  SS: 002b

> > a lockdump analysis with the decipher_lockstate_dump and parse_lockdump
> > shows the following output (The whole file is too large for the
> > mailing-list):
> >
> > Entries:  101939
> > Glocks:  60112
> > PIDs:  751
> >
> > 4 chain:
> > lockdump.node1.dec Glock (inode[2], 1114343)
> >   gl_flags = lock[1]
> >   gl_count = 5
> >   gl_state = shared[3]
> >   req_gh = yes
> >   req_bh = yes
> >   lvb_count = 0
> >   object = yes
> >   new_le = no
> >   incore_le = no
> >   reclaim = no
> >   aspace = 1
> >   ail_bufs = no
> >   Request
> >     owner = 5856
> >     gh_state = exclusive[1]
> >     gh_flags = try[0] local_excl[5] async[6]
> >     error = 0
> >     gh_iflags = promote[1]
> >   Waiter3
> >     owner = 5856
> >     gh_state = exclusive[1]
> >     gh_flags = try[0] local_excl[5] async[6]
> >     error = 0
> >     gh_iflags = promote[1]
> >   Inode: busy
> > lockdump.node2.dec Glock (inode[2], 1114343)
> >   gl_flags =
> >   gl_count = 2
> >   gl_state = unlocked[0]
> >   req_gh = no
> >   req_bh = no
> >   lvb_count = 0
> >   object = yes
> >   new_le = no
> >   incore_le = no
> >   reclaim = no
> >   aspace = 0
> >   ail_bufs = no
> >   Inode:
> >     num = 1114343/1114343
> >     type = regular[1]
> >     i_count = 1
> >     i_flags =
> >     vnode = yes
> > lockdump.node1.dec Glock (inode[2], 627732)
> >   gl_flags = dirty[5]
> >   gl_count = 379
> >   gl_state = exclusive[1]
> >   req_gh = no
> >   req_bh = no
> >   lvb_count = 0
> >   object = yes
> >   new_le = no
> >   incore_le = no
> >   reclaim = no
> >   aspace = 58
> >   ail_bufs = no
> >   Holder
> >     owner = 5856
> >     gh_state = exclusive[1]
> >     gh_flags = try[0] local_excl[5] async[6]
> >     error = 0
> >     gh_iflags = promote[1] holder[6] first[7]
> >   Waiter2
> >     owner = none[-1]
> >     gh_state = shared[3]
> >     gh_flags = try[0]
> >     error = 0
> >     gh_iflags = demote[2] alloced[4] dealloc[5]
> >   Waiter3
> >     owner = 32753
> >     gh_state = shared[3]
> >     gh_flags = any[3]
> >     error = 0
> >     gh_iflags = promote[1]
> >   [...loads of Waiter3 entries...]
> >   Waiter3
> >     owner = 4566
> >     gh_state = shared[3]
> >     gh_flags = any[3]
> >     error = 0
> >     gh_iflags = promote[1]
> >   Inode: busy
> > lockdump.node2.dec Glock (inode[2], 627732)
> >   gl_flags = lock[1]
> >   gl_count = 375
> >   gl_state = unlocked[0]
> >   req_gh = yes
> >   req_bh = yes
> >   lvb_count = 0
> >   object = yes
> >   new_le = no
> >   incore_le = no
> >   reclaim = no
> >   aspace = 0
> >   ail_bufs = no
> >   Request
> >     owner = 20187
> >     gh_state = shared[3]
> >     gh_flags = any[3]
> >     error = 0
> >     gh_iflags = promote[1]
> >   Waiter3
> >     owner = 20187
> >     gh_state = shared[3]
> >     gh_flags = any[3]
> >     error = 0
> >     gh_iflags = promote[1]
> >   [...loads of Waiter3 entries...]
> >   Waiter3
> >     owner = 10460
> >     gh_state = shared[3]
> >     gh_flags = any[3]
> >     error = 0
> >     gh_iflags = promote[1]
> >   Inode: busy
> > 2 requests
>
> --
> Linux-cluster mailing list
> Linux-cluster redhat com
> https://www.redhat.com/mailman/listinfo/linux-cluster

-- 
Gruss / Regards,

Mark Hlawatschek
http://www.atix.de/               http://www.open-sharedroot.org/

** Visit us at CeBIT 2007 in Hannover/Germany **
** in Hall 5, Booth G48/2  (15.-21. of March) **

**
ATIX - Ges. fuer Informationstechnologie und Consulting mbH
Einsteinstr. 10 - 85716 Unterschleissheim - Germany


[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]