[Linux-cluster] GFS 3 node hang in rm test

Wed Dec 8 00:13:12 UTC 2004

The latest hang is a 3 node remove hang.

I have stack traces, lockdump output from gfs_tool lockdump,
and dlm_locks output from all 3 nodes.  Except for lockdump
output on node cl032 -- it is stuck in:

gfs_tool      D 00000008     0 20033   2778                     (NOTLB)
f70c1d90 00000086 f70c1d7c 00000008 00000001 c03d8315 00000008 00000001
       d857ddc0 00001000 f70c1d8c c0180832 f689b2d8 e14890d0 00000000 c170e8c0
       c170df60 00000000 000975b2 6d2af78e 000044d3 e08b6ef0 e08b7050 00000000
Call Trace:
 [<c03d39d4>] wait_for_completion+0xa4/0xe0
 [<f8b3bd8b>] glock_wait_internal+0x3b/0x270 [gfs]
 [<f8b3c2f6>] gfs_glock_nq+0x86/0x130 [gfs]
 [<f8b3cae4>] gfs_glock_nq_init+0x34/0x50 [gfs]
 [<f8b56cda>] gfs_permission+0x4a/0x90 [gfs]
 [<c016c807>] permission+0x47/0x50
 [<c016e45f>] may_open+0x5f/0x220
 [<c016e6c7>] open_namei+0xa7/0x6e0
 [<c015d691>] filp_open+0x41/0x70
 [<c015daf6>] sys_open+0x46/0xa0
 [<c010537d>] sysenter_past_esp+0x52/0x71

The problem looks like it is on cl032, but is a little
different:

dlm_recvd     D C170DF98     0 19721      4         19722 19720 (L-TLB)
c7a3dd30 00000046 eb6f1450 c170df98 0000399e c5cbd712 00000008 0000399e
       f5208dc0 c5d11e5d 0000399e c170df98 0000000a eb6f1450 00000000 c170e8c0
       c170df60 00000000 00000971 c5d17c58 0000399e d50488b0 d5048a10 00000000
Call Trace:
 [<c03d409c>] rwsem_down_write_failed+0x9c/0x18e
 [<f8b7a28d>] .text.lock.locking+0xa6/0x1c9 [dlm]
 [<f8b78c00>] dlm_lock_stage2+0x60/0xd0 [dlm]
 [<f8b7ae7a>] process_lockqueue_reply+0x3aa/0x770 [dlm]
 [<f8b7c286>] process_cluster_request+0x816/0xeb0 [dlm]
 [<f8b80917>] midcomms_process_incoming_buffer+0x167/0x270 [dlm]
 [<f8b7e249>] receive_from_sock+0x189/0x2e0 [dlm]
 [<f8b7f3a6>] process_sockets+0x76/0xc0 [dlm]
 [<f8b7f616>] dlm_recvd+0x86/0xa0 [dlm]
 [<c013426a>] kthread+0xba/0xc0
 [<c0103325>] kernel_thread_helper+0x5/0x10

There are also a bunch of 'df' processes from cron which are
looping forever in the kernel.  They are looping in
stat_gfs_async().

So the problem is similar, a process stuck on a down_write
of a res_lock.  I'm assuming that is causing all the other
problems.

All the info is available here:
http://developer.osdl.org/daniel/GFS/rm.hang.07dec2004/
I've include the dlm_debug output also, but I do not know
how read the output.

I'm planning rebooting with a kernel with more DEBUG options
turned on (DEBUG_SLAB) to be sure that it is not accessing
freed memory.

Any other ideas on debugging?

Daniel