[Linux-cluster] umount hang on 2.6.10 and latest GFS

Fri Jan 28 01:41:10 UTC 2005

I hit a umount hang running my tests.  It was running with
2 nodes mounted cl030 and cl031.  It has finished a test
and is unmounting cl030 when it hung.  cl031 seems fine
with the gfs file system still mounted.

The gfs file system is unmounted (not in /proc/mounts), but
the umount is hung trying to stop dlm_astd

Here's the stack trace:

umount        D 00000008     0 10453  10447                     (NOTLB)
cdaa8de4 00000082 cdaa8dd4 00000008 00000001 000c0000 00000008 00000002
       c1bce798 00000286 e8f782e0 cdaa8dc4 c0116871 e9db55e0 960546f9 c170ef60
       00000000 0001fba3 0167aae6 00005e6b f74f8080 f74f81ec c170ef60 00000000
Call Trace:
 [<c03ce814>] wait_for_completion+0xa4/0xe0
 [<c01326a5>] kthread_stop+0x85/0xae
 [<f8aca033>] astd_stop+0x13/0x32 [dlm]
 [<f8ad1e51>] dlm_release+0x91/0xa0 [dlm]
 [<f8ad2832>] release_lockspace+0x222/0x2f0 [dlm]
 [<f8c2f22c>] release_gdlm+0x1c/0x30 [lock_dlm]
 [<f8c2f55f>] lm_dlm_unmount+0x4f/0x70 [lock_dlm]
 [<f881242c>] lm_unmount+0x3c/0xa0 [lock_harness]
 [<f8fb60ef>] gfs_lm_unmount+0x2f/0x40 [gfs]
 [<f8fc62ab>] gfs_put_super+0x2fb/0x3a0 [gfs]
 [<c0165d67>] generic_shutdown_super+0x127/0x140
 [<f8fc337e>] gfs_kill_sb+0x2e/0x69 [gfs]
 [<c0165b71>] deactivate_super+0x81/0xa0
 [<c017c4dc>] sys_umount+0x3c/0xa0
 [<c017c559>] sys_oldumount+0x19/0x20
 [<c010323d>] sysenter_past_esp+0x52/0x75

dlm_astd      D 00000008     0 10264      6                3235 (L-TLB)
dc9c3ee8 00000046 dc9c3ed8 00000008 00000002 00000800 00000008 c8cc35e0
       f7bc0568 5f8a4c1c 0179a889 e4676c5a 00004b2d dc9c3f14 c051c000 c1716f60
       00000001 000001b0 0167c50b 00005e6b e9db55e0 e9db574c c1714060 00000000
Call Trace:
 [<c03cef7c>] rwsem_down_read_failed+0x9c/0x190
 [<f8aca119>] .text.lock.ast+0xc7/0x1de [dlm]
 [<f8ac9ea5>] dlm_astd+0x1e5/0x210 [dlm]
 [<c013245a>] kthread+0xba/0xc0
 [<c0101315>] kernel_thread_helper+0x5/0x10

So, it looks like dlm_astd is stuck on a down_read().

The only down_read I see is in process_asts().

	down_read(&ls->ls_in_recovery);

So, it looks block on recovery of the lockspace, but the
DLM is not listed in /proc/cluster/services and
/proc/cluster/dlm_locks shows no locks.

Full info available here:
http://developer.osdl.org/daniel/GFS/test.25jan2005/

Ideas?

Daniel