[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]

Re: [Linux-cluster] gfs deadlock situation



Mark Hlawatschek wrote:
Hi,

we have the following deadlock situation:

2 node cluster consisting of node1 and node2. /usr/local is placed on a GFS filesystem mounted on both nodes. Lockmanager is dlm.
We are using RHEL4u4

a strace to ls -l /usr/local/swadmin/mnx/xml ends up in
lstat("/usr/local/swadmin/mnx/xml",
This happens on both cluster nodes.

All processes trying to access the directory /usr/local/swadmin/mnx/xml are in "Waiting for IO (D)" state. I.e. system load is at about 400 ;-)

Any ideas ?
Quickly browsing this, look to me that process with pid=5856 got stuck. That process had the file or directory (ino number 627732 - probably /usr/local/swadmin/mnx/xml) exclusive lock so everyone was waiting for it. The faulty process was apparently in the middle of obtaining another exclusive lock (and almost got it). We need to know where pid=5856 was stuck at that time. If this occurs again, could you use "crash" to back trace that process and show us the output ?

-- Wendy
a lockdump analysis with the decipher_lockstate_dump and parse_lockdump shows the following output (The whole file is too large for the mailing-list):

Entries:  101939
Glocks:  60112
PIDs:  751

4 chain:
lockdump.node1.dec Glock (inode[2], 1114343)
  gl_flags = lock[1]
  gl_count = 5
  gl_state = shared[3]
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 1
  ail_bufs = no
  Request
    owner = 5856
    gh_state = exclusive[1]
    gh_flags = try[0] local_excl[5] async[6]
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 5856
    gh_state = exclusive[1]
    gh_flags = try[0] local_excl[5] async[6]
    error = 0
    gh_iflags = promote[1]
  Inode: busy
lockdump.node2.dec Glock (inode[2], 1114343)
  gl_flags =
  gl_count = 2
  gl_state = unlocked[0]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Inode:
    num = 1114343/1114343
    type = regular[1]
    i_count = 1
    i_flags =
    vnode = yes
lockdump.node1.dec Glock (inode[2], 627732)
  gl_flags = dirty[5]
  gl_count = 379
  gl_state = exclusive[1]
  req_gh = no
  req_bh = no
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 58
  ail_bufs = no
  Holder
    owner = 5856
    gh_state = exclusive[1]
    gh_flags = try[0] local_excl[5] async[6]
    error = 0
    gh_iflags = promote[1] holder[6] first[7]
  Waiter2
    owner = none[-1]
    gh_state = shared[3]
    gh_flags = try[0]
    error = 0
    gh_iflags = demote[2] alloced[4] dealloc[5]
  Waiter3
    owner = 32753
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  [...loads of Waiter3 entries...]
  Waiter3
    owner = 4566
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Inode: busy
lockdump.node2.dec Glock (inode[2], 627732)
  gl_flags = lock[1]
  gl_count = 375
  gl_state = unlocked[0]
  req_gh = yes
  req_bh = yes
  lvb_count = 0
  object = yes
  new_le = no
  incore_le = no
  reclaim = no
  aspace = 0
  ail_bufs = no
  Request
    owner = 20187
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 20187
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  [...loads of Waiter3 entries...]
  Waiter3
    owner = 10460
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Inode: busy
2 requests



[Date Prev][Date Next]   [Thread Prev][Thread Next]   [Thread Index] [Date Index] [Author Index]