[Linux-cluster] GFS 2 node hang in rm test

Fri Dec 3 23:08:00 UTC 2004

I ran my test script
(http://developer.osdl.org/daniel/gfs_tests/test.sh) overnight.

It ran 17 test runs before hanging in a rm during a 2 node test.
The /gfs_stripe5 is mounted on cl030 and cl031.

process 28723 (rm) on cl030 is hung.
process 29693 (updatedb) is also hung on cl030.

process 29537 (updatedb) is hung on cl031.

I have stack traces and lockdump and lock debug output
from both nodes here:

http://developer.osdl.org/daniel/GFS/gfs_2node_rm_hang/

gfs_tool/decipher_lockstate_dump cl030.lockdump shows:

Glock (inode[2], 39860)
  gl_flags =
  gl_count = 6
  gl_state = shared[3]
  lvb_count = 0
  object = yes
  aspace = 2
  reclaim = no
  Holder
    owner = 28723
    gh_state = shared[3]
    gh_flags = atime[9]
    error = 0
    gh_iflags = promote[1] holder[6] first[7]
  Waiter2
    owner = none[-1]
    gh_state = unlocked[0]
    gh_flags = try[0]
    error = 0
    gh_iflags = demote[2] alloced[4] dealloc[5]
  Waiter3
    owner = 29693
    gh_state = shared[3]
    gh_flags = any[3]
    error = 0
    gh_iflags = promote[1]
  Inode: busy

gfs_tool/decipher_lockstate_dump cl031.lockdump shows:

Glock (inode[2], 39860)
  gl_flags = lock[1]
  gl_count = 5
  gl_state = shared[3]
  lvb_count = 0
  object = yes
  aspace = 1
  reclaim = no
  Request
    owner = 29537
    gh_state = exclusive[1]
    gh_flags = local_excl[5] atime[9]
    error = 0
    gh_iflags = promote[1]
  Waiter3
    owner = 29537
    gh_state = exclusive[1]
    gh_flags = local_excl[5] atime[9]
    error = 0
    gh_iflags = promote[1]
  Inode: busy

Is there any documentation on what these fields are?

What is the difference between Waiter2 and Waiter3?

If I understand this correctly, the updatedb (29537) on
cl031 is trying to go from shared -> exclusive while the 
rm (28723) on cl030 is holding the glock shared and the
updatedb (29693) on cl030 is waiting to get the glock shared.

Questions:

How does one know which node is the master for a lock?

Shouldn't the cl030 know (bast) that the updatedb on cl031
is trying to go shared->exclusive?

What does the gfs_tool/parse_lockdump script do?

I have include the output from /proc/cluster/lock_dlm/debug,
but I have no idea what that data is.  Any hints?

Anything else I can do to debug this further?

Thanks,

Daniel