[Linux-cluster] GFS 3 node hang in rm test
Daniel McNeil
daniel at osdl.org
Wed Dec 8 00:13:12 UTC 2004
The latest hang is a 3 node remove hang.
I have stack traces, lockdump output from gfs_tool lockdump,
and dlm_locks output from all 3 nodes. Except for lockdump
output on node cl032 -- it is stuck in:
gfs_tool D 00000008 0 20033 2778 (NOTLB)
f70c1d90 00000086 f70c1d7c 00000008 00000001 c03d8315 00000008 00000001
d857ddc0 00001000 f70c1d8c c0180832 f689b2d8 e14890d0 00000000 c170e8c0
c170df60 00000000 000975b2 6d2af78e 000044d3 e08b6ef0 e08b7050 00000000
Call Trace:
[<c03d39d4>] wait_for_completion+0xa4/0xe0
[<f8b3bd8b>] glock_wait_internal+0x3b/0x270 [gfs]
[<f8b3c2f6>] gfs_glock_nq+0x86/0x130 [gfs]
[<f8b3cae4>] gfs_glock_nq_init+0x34/0x50 [gfs]
[<f8b56cda>] gfs_permission+0x4a/0x90 [gfs]
[<c016c807>] permission+0x47/0x50
[<c016e45f>] may_open+0x5f/0x220
[<c016e6c7>] open_namei+0xa7/0x6e0
[<c015d691>] filp_open+0x41/0x70
[<c015daf6>] sys_open+0x46/0xa0
[<c010537d>] sysenter_past_esp+0x52/0x71
The problem looks like it is on cl032, but is a little
different:
dlm_recvd D C170DF98 0 19721 4 19722 19720 (L-TLB)
c7a3dd30 00000046 eb6f1450 c170df98 0000399e c5cbd712 00000008 0000399e
f5208dc0 c5d11e5d 0000399e c170df98 0000000a eb6f1450 00000000 c170e8c0
c170df60 00000000 00000971 c5d17c58 0000399e d50488b0 d5048a10 00000000
Call Trace:
[<c03d409c>] rwsem_down_write_failed+0x9c/0x18e
[<f8b7a28d>] .text.lock.locking+0xa6/0x1c9 [dlm]
[<f8b78c00>] dlm_lock_stage2+0x60/0xd0 [dlm]
[<f8b7ae7a>] process_lockqueue_reply+0x3aa/0x770 [dlm]
[<f8b7c286>] process_cluster_request+0x816/0xeb0 [dlm]
[<f8b80917>] midcomms_process_incoming_buffer+0x167/0x270 [dlm]
[<f8b7e249>] receive_from_sock+0x189/0x2e0 [dlm]
[<f8b7f3a6>] process_sockets+0x76/0xc0 [dlm]
[<f8b7f616>] dlm_recvd+0x86/0xa0 [dlm]
[<c013426a>] kthread+0xba/0xc0
[<c0103325>] kernel_thread_helper+0x5/0x10
There are also a bunch of 'df' processes from cron which are
looping forever in the kernel. They are looping in
stat_gfs_async().
So the problem is similar, a process stuck on a down_write
of a res_lock. I'm assuming that is causing all the other
problems.
All the info is available here:
http://developer.osdl.org/daniel/GFS/rm.hang.07dec2004/
I've include the dlm_debug output also, but I do not know
how read the output.
I'm planning rebooting with a kernel with more DEBUG options
turned on (DEBUG_SLAB) to be sure that it is not accessing
freed memory.
Any other ideas on debugging?
Daniel
More information about the Linux-cluster
mailing list